51
|
Leschziner GD, Jorgensen AL, Andrew T, Williamson PR, Marson AG, Coffey AJ, Middleditch C, Balding DJ, Rogers J, Bentley DR, Chadwick D, Johnson MR, Pirmohamed M. The association between polymorphisms in RLIP76 and drug response in epilepsy. Pharmacogenomics 2008; 8:1715-22. [PMID: 18086001 DOI: 10.2217/14622416.8.12.1715] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
INTRODUCTION Approximately 30% of patients with epilepsy are resistant to treatment with anti-epileptic drugs (AEDs). The ABC drug transporter proteins are hypothesized to mediate drug resistance in epilepsy. More recently, a non-ABC putative transporter, RLIP76, has also been proposed to be involved in the mechanism of pharmacoresistance. One previous association study of six polymorphisms in RLIP76 failed to find any association with drug resistance in a retrospective cohort of epilepsy patients. We aimed to look for an association with outcomes reflecting drug response in a larger prospective cohort, with gene-wide coverage. PATIENTS AND METHODS We investigated the role of common polymorphisms in RLIP76 in epilepsy pharmacoresistance by genotyping 23 common RLIP76 polymorphisms in a prospective cohort of 503 epilepsy patients, from the standard and new anti-epileptic drugs (SANAD) prospective study of new and old AEDs. A total of 13 of these were tested for association with four outcomes reflecting response to drugs: time to first seizure, time to 12-month remission, time to withdrawal due to inadequate seizure control, and time to withdrawal due to unacceptable adverse drug events. RESULTS No significant associations, allowing for multiple testing, were found in the whole cohort. There was also no effect in a subgroup of patients on carbamazepine, which is thought to be a RLIP76 substrate, although two polymorphisms were associated with time to first seizure (p = 0.007). DISCUSSION We failed to demonstrate any association between RLIP76 polymorphisms and four different measures of drug response in the larger cohort, but a subgroup analysis of patients receiving carbamazepine suggested an association that should be investigated further. CONCLUSIONS Our data suggest that common variants in RLIP76 are unlikely to contribute to epilepsy drug response.
Collapse
|
52
|
Su SY, Balding DJ, Coin LJM. Disease association tests by inferring ancestral haplotypes using a hidden markov model. ACTA ACUST UNITED AC 2008; 24:972-8. [PMID: 18296746 DOI: 10.1093/bioinformatics/btn071] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
MOTIVATION Most genome-wide association studies rely on single nucleotide polymorphism (SNP) analyses to identify causal loci. The increased stringency required for genome-wide analyses (with per-SNP significance threshold typically approximately 10(-7)) means that many real signals will be missed. Thus it is still highly relevant to develop methods with improved power at low type I error. Haplotype-based methods provide a promising approach; however, they suffer from statistical problems such as abundance of rare haplotypes and ambiguity in defining haplotype block boundaries. RESULTS We have developed an ancestral haplotype clustering (AncesHC) association method which addresses many of these problems. It can be applied to biallelic or multiallelic markers typed in haploid, diploid or multiploid organisms, and also handles missing genotypes. Our model is free from the assumption of a rigid block structure but recognizes a block-like structure if it exists in the data. We employ a Hidden Markov Model (HMM) to cluster the haplotypes into groups of predicted common ancestral origin. We then test each cluster for association with disease by comparing the numbers of cases and controls with 0, 1 and 2 chromosomes in the cluster. We demonstrate the power of this approach by simulation of case-control status under a range of disease models for 1500 outcrossed mice originating from eight inbred lines. Our results suggest that AncesHC has substantially more power than single-SNP analyses to detect disease association, and is also more powerful than the cladistic haplotype clustering method CLADHC. AVAILABILITY The software can be downloaded from http://www.imperial.ac.uk/medicine/people/l.coin.
Collapse
|
53
|
Hoggart CJ, Chadeau-Hyam M, Clark TG, Lampariello R, Whittaker JC, De Iorio M, Balding DJ. Sequence-level population simulations over large genomic regions. Genetics 2007; 177:1725-31. [PMID: 17947444 PMCID: PMC2147962 DOI: 10.1534/genetics.106.069088] [Citation(s) in RCA: 88] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2006] [Accepted: 08/30/2007] [Indexed: 11/18/2022] Open
Abstract
Simulation is an invaluable tool for investigating the effects of various population genetics modeling assumptions on resulting patterns of genetic diversity, and for assessing the performance of statistical techniques, for example those designed to detect and measure the genomic effects of selection. It is also used to investigate the effectiveness of various design options for genetic association studies. Backward-in-time simulation methods are computationally efficient and have become widely used since their introduction in the 1980s. The forward-in-time approach has substantial advantages in terms of accuracy and modeling flexibility, but at greater computational cost. We have developed flexible and efficient simulation software and a rescaling technique to aid computational efficiency that together allow the simulation of sequence-level data over large genomic regions in entire diploid populations under various scenarios for demography, mutation, selection, and recombination, the latter including hotspots and gene conversion. Our forward evolution of genomic regions (FREGENE) software is freely available from www.ebi.ac.uk/projects/BARGEN together with an ancillary program to generate phenotype labels, either binary or quantitative. In this article we discuss limitations of coalescent-based simulation, introduce the rescaling technique that makes large-scale forward-in-time simulation feasible, and demonstrate the utility of various features of FREGENE, many not previously available.
Collapse
|
54
|
Ioannidis JPA, Boffetta P, Little J, O'Brien TR, Uitterlinden AG, Vineis P, Balding DJ, Chokkalingam A, Dolan SM, Flanders WD, Higgins JPT, McCarthy MI, McDermott DH, Page GP, Rebbeck TR, Seminara D, Khoury MJ. Assessment of cumulative evidence on genetic associations: interim guidelines. Int J Epidemiol 2007; 37:120-32. [PMID: 17898028 DOI: 10.1093/ije/dym159] [Citation(s) in RCA: 442] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Established guidelines for causal inference in epidemiological studies may be inappropriate for genetic associations. A consensus process was used to develop guidance criteria for assessing cumulative epidemiologic evidence in genetic associations. A proposed semi-quantitative index assigns three levels for the amount of evidence, extent of replication, and protection from bias, and also generates a composite assessment of 'strong', 'moderate' or 'weak' epidemiological credibility. In addition, we discuss how additional input and guidance can be derived from biological data. Future empirical research and consensus development are needed to develop an integrated model for combining epidemiological and biological evidence in the rapidly evolving field of investigation of genetic factors.
Collapse
|
55
|
Graffelman J, Balding DJ, Gonzalez-Neira A, Bertranpetit J. Variation in estimated recombination rates across human populations. Hum Genet 2007; 122:301-10. [PMID: 17609980 DOI: 10.1007/s00439-007-0391-6] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2006] [Accepted: 06/01/2007] [Indexed: 12/11/2022]
Abstract
Recently it has been reported that recombination hotspots appear to be highly variable between humans and chimpanzees, and there is evidence for between-person variability in hotspots, and evolutionary transience. To understand the nature of variation in human recombination rates, it is important to describe patterns of variability across populations. Direct measurement of recombination rates remains infeasible on a large scale, and population-genetic approaches can be imprecise, and are affected by demographic history. Reports to date have suggested broad similarity in recombination rates at large genomic scales and across human populations. Here, we examine recombination rate estimates at a finer population and genomic scale: 28 worldwide populations and 107 SNPs in a 1 Mb stretch of chromosome 22q. We employ analysis of variance of recombination rate estimates, corrected for differences in effective population size using genome-wide microsatellite mutation rate estimates. We find substantial variation in fine-scale rates between populations, but reduced variation within continental groups. All effects examined (SNP-pair, region, population and interactions) were highly significant. Adjustment for effective population size made little difference to the conclusions. Observed hotspots tended to be conserved across populations, albeit at varying intensities. This holds particularly for populations from the same region, and also to a considerable degree across geographical regions. However, some hotspots appear to be population-specific. Several results from studies on the population history of humans are in accordance with our analysis. Our results suggest that between-population variation in DNA sequences may underly recombination rate variation.
Collapse
|
56
|
Leschziner GD, Andrew T, Leach JP, Chadwick D, Coffey AJ, Balding DJ, Bentley DR, Pirmohamed M, Johnson MR. Common ABCB1 polymorphisms are not associated with multidrug resistance in epilepsy using a gene-wide tagging approach. Pharmacogenet Genomics 2007; 17:217-20. [PMID: 17460550 DOI: 10.1097/01.fpc.0000230408.23146.b1] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
P-glycoprotein, the product of the ABCB1 gene, is a proposed mechanism of pharmacoresistance in epilepsy. Previous attempts to correlate the ABCB1 C3435T SNP, or a three-SNP haplotype containing C3435T with epilepsy pharmacoresistance have produced discordant findings. We analysed these single nucleotide polymorphisms (SNPs), plus a more comprehensive set of tagging SNPs describing common variation in ABCB1 in a case-control study. No significant association of C3435T (P=0.55), the three-SNP haplotype (lowest P=0.14) or any gene-wide tagging SNP (lowest P=0.17) with multidrug resistance in epilepsy was identified. Meta-analysis of studies using the same definition of multidrug resistance (n=1064) also demonstrated no significant association of C3435T with multidrug resistance (P=0.31). These findings suggest that C3435T is unlikely to be a marker for epilepsy multidrug resistance. In addition, no evidence for a role of other common ABCB1 polymorphisms was found using a potentially more powerful gene-wide tagging approach.
Collapse
|
57
|
Baksh MF, Balding DJ, Vyse TJ, Whittaker JC. Family-based association analysis with ordered categorical phenotypes, covariates and interactions. Genet Epidemiol 2007; 31:1-8. [PMID: 17096343 DOI: 10.1002/gepi.20183] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Genetic association analyses of family-based studies with ordered categorical phenotypes are often conducted using methods either for quantitative or for binary traits, which can lead to suboptimal analyses. Here we present an alternative likelihood-based method of analysis for single nucleotide polymorphism (SNP) genotypes and ordered categorical phenotypes in nuclear families of any size. Our approach, which extends our previous work for binary phenotypes, permits straightforward inclusion of covariate, gene-gene and gene-covariate interaction terms in the likelihood, incorporates a simple model for ascertainment and allows for family-specific effects in the hypothesis test. Additionally, our method produces interpretable parameter estimates and valid confidence intervals. We assess the proposed method using simulated data, and apply it to a polymorphism in the c-reactive protein (CRP) gene typed in families collected to investigate human systemic lupus erythematosus. By including sex interactions in the analysis, we show that the polymorphism is associated with anti-nuclear autoantibody (ANA) production in females, while there appears to be no effect in males.
Collapse
|
58
|
Sladek R, Rocheleau G, Rung J, Dina C, Shen L, Serre D, Boutin P, Vincent D, Belisle A, Hadjadj S, Balkau B, Heude B, Charpentier G, Hudson TJ, Montpetit A, Pshezhetsky AV, Prentki M, Posner BI, Balding DJ, Meyre D, Polychronakos C, Froguel P. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature 2007; 445:881-5. [PMID: 17293876 DOI: 10.1038/nature05616] [Citation(s) in RCA: 2056] [Impact Index Per Article: 120.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2006] [Accepted: 01/23/2007] [Indexed: 11/08/2022]
Abstract
Type 2 diabetes mellitus results from the interaction of environmental factors with a combination of genetic variants, most of which were hitherto unknown. A systematic search for these variants was recently made possible by the development of high-density arrays that permit the genotyping of hundreds of thousands of polymorphisms. We tested 392,935 single-nucleotide polymorphisms in a French case-control cohort. Markers with the most significant difference in genotype frequencies between cases of type 2 diabetes and controls were fast-tracked for testing in a second cohort. This identified four loci containing variants that confer type 2 diabetes risk, in addition to confirming the known association with the TCF7L2 gene. These loci include a non-synonymous polymorphism in the zinc transporter SLC30A8, which is expressed exclusively in insulin-producing beta-cells, and two linkage disequilibrium blocks that contain genes potentially involved in beta-cell development or function (IDE-KIF11-HHEX and EXT2-ALX4). These associations explain a substantial portion of disease risk and constitute proof of principle for the genome-wide approach to the elucidation of complex genetic traits.
Collapse
|
59
|
Balding DJ, Gastwirth JL. Introduction. Int Stat Rev 2007. [DOI: 10.1111/j.1751-5823.2003.tb00206.x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
60
|
Abstract
Although genetic association studies have been with us for many years, even for the simplest analyses there is little consensus on the most appropriate statistical procedures. Here I give an overview of statistical approaches to population association studies, including preliminary analyses (Hardy-Weinberg equilibrium testing, inference of phase and missing data, and SNP tagging), and single-SNP and multipoint tests for association. My goal is to outline the key methods with a brief discussion of problems (population structure and multiple testing), avenues for solutions and some ongoing developments.
Collapse
|
61
|
Leschziner G, Zabaneh D, Pirmohamed M, Owen A, Rogers J, Coffey AJ, Balding DJ, Bentley DB, Johnson MR. Exon sequencing and high resolution haplotype analysis of ABC transporter genes implicated in drug resistance. Pharmacogenet Genomics 2006; 16:439-50. [PMID: 16708052 DOI: 10.1097/01.fpc.0000197467.21964.67] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
BACKGROUND The ATP-binding cassette (ABC) proteins are a superfamily of efflux pumps implicated as a mechanism for multidrug resistance in cytotoxic chemotherapy, immunosuppressive therapy, HIV and epilepsy. Genetic variation in P-glycoprotein, the product of the ABCB1 gene, is proposed to mediate de novo drug resistance, but associations between polymorphisms in ABCB1 and pharmacoresistance have produced conflicting results. Potential explanations for the inconsistency of results include inadequate characterization of gene structure, variation and linkage disequilibrium (LD) in ABCB1, as well as overlap in substrate specificity between ABCB1 and the various other drug transporters. METHODS AND RESULTS We undertook a fundamental analysis of gene structure, variation and LD in ABCB1 and four other drug transporter genes implicated in pharmacoresistance: ABCC1, ABCC2, ABCC5 and ABCB4. Manual annotation of the five genes revealed nine shorter alternative transcripts with new untranslated regions and one novel region of coding sequence, demonstrating that on-line annotations are incomplete. Sequencing of exons in 47 Caucasian individuals identified 75 novel single nucleotide polymorphisms (SNPs) previously undescribed in any public database, including 14 new coding sequence SNPs. Genotyping of 502 SNPs in 842 Caucasian individuals across the five genes revealed large blocks of high LD, and low haplotype diversity across all five genes that could be characterized by between 67 and 114 tagging SNPs, depending on the tagging criteria. CONCLUSION The study illustrates that publicly available data resources on genomic organization of genes and common variation can have important gaps and limitations, and establishes a comprehensive set of tagging SNPs for future association studies in pharmacoresistance.
Collapse
|
62
|
Leschziner G, Jorgensen AL, Andrew T, Pirmohamed M, Williamson PR, Marson AG, Coffey AJ, Middleditch C, Rogers J, Bentley DR, Chadwick DW, Balding DJ, Johnson MR. Clinical factors and ABCB1 polymorphisms in prediction of antiepileptic drug response: a prospective cohort study. Lancet Neurol 2006; 5:668-76. [PMID: 16857572 DOI: 10.1016/s1474-4422(06)70500-2] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
BACKGROUND The ABCB1 3435C-->T single-nucleotide polymorphism (SNP) or a three-SNP haplotype containing 3435C-->T has been implicated in multidrug resistance in epilepsy in three retrospective case-control studies, but a further three have failed to replicate the association. We aimed to determine the effect of the ABCB1 gene on epilepsy drug response, using a unique large cohort of epilepsy patients with prospectively measured seizure and drug response outcomes. METHODS The ABCB1 3435C-->T polymorphism and three-SNP haplotype, plus a comprehensive set of tag SNPs across ABCB1 and adjacent ABCB4, were genotyped in a cohort of 503 epilepsy patients with prospectively measured seizure and drug response outcomes. Clinical, demographic, and genetic data were analysed. Treatment outcome was measured in terms of time to 12-month remission, time to first seizure, and time to drug withdrawal due to inadequate seizure control or side-effects. Randomly selected genome-wide HapMap SNPs (n=129) were genotyped in all patients for genomic control. FINDINGS Number of seizures before treatment was the dominant feature predicting seizure outcome after starting antiepileptic drug therapy, measured by both time to first seizure (hazard ratio 1.34, 95% CI 1.21-1.49, p<0.0001) and time to 12-month remission (0.83, 0.73-0.94, p=0.003). There was no association of the ABCB1 3435C-->T polymorphism, the three-SNP haplotype, or any gene-wide tag SNP with time to first seizure after starting drug therapy, time to 12-month remission, or time to drug withdrawal due to unacceptable side-effects or to lack of seizure control. INTERPRETATION We found no evidence that ABCB1 common variation influences either seizure or drug withdrawal outcomes after initiation of antiepileptic drug therapy.
Collapse
|
63
|
Mayor LR, Balding DJ. Discrimination of half-siblings when maternal genotypes are known. Forensic Sci Int 2006; 159:141-7. [PMID: 16153794 DOI: 10.1016/j.forsciint.2005.07.007] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2005] [Accepted: 07/18/2005] [Indexed: 11/23/2022]
Abstract
Given the DNA profiles of two individuals and one parent (say the mother) of each, we present likelihood ratios (LRs) comparing the hypothesis that they have the same father with the hypothesis of unrelated fathers. If the individuals have the same mother, the problem is to distinguish full- from half-siblings, otherwise we are comparing a half-sibling relationship with unrelated. We simulate STR profiles at up to 60 loci, based on allele proportions observed at 15 loci in three populations, and use them to approximate misclassification rates both for binary classification (e.g. "half-sib" versus "unrelated"), and when a third "cannot say" category is included. We find that reliable inferences in the absence of the mothers' profiles require many more STR loci than the 10-25 loci that are currently routinely available. However, profiling the two mothers conveys more discriminatory power than profiling the same number of additional loci in the individuals themselves. Our likelihood ratio formulas include a theta (or Fst) adjustment to allow for the individuals concerned to have recent shared ancestry (coancestry), relative to the population from which the allele frequency database is drawn. We illustrate that using an appropriate value of theta can reduce the average misclassification rate.
Collapse
|
64
|
Baksh MF, Balding DJ, Vyse TJ, Whittaker JC. A Likelihood Ratio Approach to Family-based Association Studies with Covariates. Ann Hum Genet 2006; 70:131-9. [PMID: 16441262 DOI: 10.1111/j.1529-8817.2005.00189.x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
We introduce a procedure for association based analysis of nuclear families that allows for dichotomous and more general measurements of phenotype and inclusion of covariate information. Standard generalized linear models are used to relate phenotype and its predictors. Our test procedure, based on the likelihood ratio, unifies the estimation of all parameters through the likelihood itself and yields maximum likelihood estimates of the genetic relative risk and interaction parameters. Our method has advantages in modelling the covariate and gene-covariate interaction terms over recently proposed conditional score tests that include covariate information via a two-stage modelling approach. We apply our method in a study of human systemic lupus erythematosus and the C-reactive protein that includes sex as a covariate.
Collapse
|
65
|
Waldron ERB, Whittaker JC, Balding DJ. Fine mapping of disease genes via haplotype clustering. Genet Epidemiol 2006; 30:170-9. [PMID: 16385468 DOI: 10.1002/gepi.20134] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
We propose an algorithm for analysing SNP-based population association studies, which is a development of that introduced by Molitor et al. [2003: Am J Hum Genet 73:1368-1384]. It uses clustering of haplotypes to overcome the major limitations of many current haplotype-based approaches. We define a between-haplotype score that is simple, yet appears to capture much of the information about evolutionary relatedness of the haplotypes in the vicinity of a (unobserved) putative causal locus. Haplotype clusters can then be defined via a putative ancestral haplotype and a cut-off distance. The number of an individual's two haplotypes that lie within the cluster predicts the individual's genotype at the causal locus. This predicted genotype can then be investigated for association with the phenotype of interest. We implement our approach within a Markov-chain Monte Carlo algorithm that, in effect, searches over locations and ancestral haplotypes to identify large, case-rich clusters. The algorithm successfully fine-maps a causal mutation in a test analysis using real data, and achieves almost 98% accuracy in predicting the genotype at the causal locus. A simulation study indicates that the new algorithm is substantially superior to alternative approaches, and it also allows us to identify situations in which multi-point approaches can substantially improve over single-SNP analyses. Our algorithm runs quickly and there is scope for extension to a wide range of disease models and genomic scales.
Collapse
|
66
|
Setakis E, Stirnadel H, Balding DJ. Logistic regression protects against population structure in genetic association studies. Genome Res 2005; 16:290-6. [PMID: 16354752 PMCID: PMC1361725 DOI: 10.1101/gr.4346306] [Citation(s) in RCA: 87] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
We conduct an extensive simulation study to compare the merits of several methods for using null (unlinked) markers to protect against false positives due to cryptic substructure in population-based genetic association studies. The more sophisticated "structured association" methods perform well but are computationally demanding and rely on estimating the correct number of subpopulations. The simple and fast "genomic control" approach can lose power in certain scenarios. We find that procedures based on logistic regression that are flexible, computationally fast, and easy to implement also provide good protection against the effects of cryptic substructure, even though they do not explicitly model the population structure.
Collapse
|
67
|
Ayres KL, Balding DJ. Paternity index calculations when some individuals share common ancestry. Forensic Sci Int 2005; 151:101-3. [PMID: 15935949 DOI: 10.1016/j.forsciint.2004.10.007] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2004] [Revised: 10/14/2004] [Accepted: 10/15/2004] [Indexed: 11/20/2022]
|
68
|
Abstract
The identification of signatures of natural selection in genomic surveys has become an area of intense research, stimulated by the increasing ease with which genetic markers can be typed. Loci identified as subject to selection may be functionally important, and hence (weak) candidates for involvement in disease causation. They can also be useful in determining the adaptive differentiation of populations, and exploring hypotheses about speciation. Adaptive differentiation has traditionally been identified from differences in allele frequencies among different populations, summarised by an estimate of FST. Low outliers relative to an appropriate neutral population-genetics model indicate loci subject to balancing selection, whereas high outliers suggest adaptive (directional) selection. However, the problem of identifying statistically significant departures from neutrality is complicated by confounding effects on the distribution of FST estimates, and current methods have not yet been tested in large-scale simulation experiments. Here, we simulate data from a structured population at many unlinked, diallelic loci that are predominantly neutral but with some loci subject to adaptive or balancing selection. We develop a hierarchical-Bayesian method, implemented via Markov chain Monte Carlo (MCMC), and assess its performance in distinguishing the loci simulated under selection from the neutral loci. We also compare this performance with that of a frequentist method, based on moment-based estimates of FST. We find that both methods can identify loci subject to adaptive selection when the selection coefficient is at least five times the migration rate. Neither method could reliably distinguish loci under balancing selection in our simulations, even when the selection coefficient is twenty times the migration rate.
Collapse
|
69
|
Mayor LR, Fleming KP, Müller A, Balding DJ, Sternberg MJE. Clustering of protein domains in the human genome. J Mol Biol 2004; 340:991-1004. [PMID: 15236962 DOI: 10.1016/j.jmb.2004.05.036] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2003] [Revised: 03/30/2004] [Accepted: 05/17/2004] [Indexed: 11/30/2022]
Abstract
We present a systematic study of the clustering of genes within the human genome based on homology inferred from both sequence and structural similarity. The 3D-Genomics automated proteome annotation pipeline () was utilised to infer homology for each protein domain in the genome, for the 26 superfamilies most highly represented in the Structural Classification Of Proteins (SCOP) database. This approach enabled us to identify homologues that could not be detected by sequence-based methods alone. For each superfamily, we investigated the distribution, both within and among chromosomes, of genes encoding at least one domain within the superfamily. The results indicate a diversity of clustering behaviours: some superfamilies showed no evidence of any clustering, and others displayed significant clustering either within or among chromosomes, or both. Removal of tandem repeats reduced the levels of clustering observed, but some superfamilies still displayed highly significant clustering. Thus, our study suggests that either the process of gene duplication, or the evolution of the resulting clusters, differs between structural superfamilies.
Collapse
|
70
|
Morris AP, Whittaker JC, Balding DJ. Little loss of information due to unknown phase for fine-scale linkage-disequilibrium mapping with single-nucleotide-polymorphism genotype data. Am J Hum Genet 2004; 74:945-53. [PMID: 15077198 PMCID: PMC1181987 DOI: 10.1086/420773] [Citation(s) in RCA: 57] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2004] [Accepted: 02/12/2004] [Indexed: 11/03/2022] Open
Abstract
We present the results of a simulation study that indicate that true haplotypes at multiple, tightly linked loci often provide little extra information for linkage-disequilibrium fine mapping, compared with the information provided by corresponding genotypes, provided that an appropriate statistical analysis method is used. In contrast, a two-stage approach to analyzing genotype data, in which haplotypes are inferred and then analyzed as if they were true haplotypes, can lead to a substantial loss of information. The study uses our COLDMAP software for fine mapping, which implements a Markov chain-Monte Carlo algorithm that is based on the shattered coalescent model of genetic heterogeneity at a disease locus. We applied COLDMAP to 100 replicate data sets simulated under each of 18 disease models. Each data set consists of haplotype pairs (diplotypes) for 20 SNPs typed at equal 50-kb intervals in a 950-kb candidate region that includes a single disease locus located at random. The data sets were analyzed in three formats: (1). as true haplotypes; (2). as haplotypes inferred from genotypes using an expectation-maximization algorithm; and (3). as unphased genotypes. On average, true haplotypes gave a 6% gain in efficiency compared with the unphased genotypes, whereas inferring haplotypes from genotypes led to a 20% loss of efficiency, where efficiency is defined in terms of root mean integrated square error of the location of the disease locus. Furthermore, treating inferred haplotypes as if they were true haplotypes leads to considerable overconfidence in estimates, with nominal 50% credibility intervals achieving, on average, only 19% coverage. We conclude that (1). given appropriate statistical analyses, the costs of directly measuring haplotypes will rarely be justified by a gain in the efficiency of fine mapping and that (2). a two-stage approach of inferring haplotypes followed by a haplotype-based analysis can be very inefficient for fine mapping, compared with an analysis based directly on the genotypes.
Collapse
|
71
|
Morris AP, Whittaker JC, Xu CF, Hosking LK, Balding DJ. Multipoint linkage-disequilibrium mapping narrows location interval and identifies mutation heterogeneity. Proc Natl Acad Sci U S A 2003; 100:13442-6. [PMID: 14597696 PMCID: PMC263833 DOI: 10.1073/pnas.2235031100] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Single-nucleotide polymorphism (SNP) genotypes were recently examined in an 890-kb region flanking the human gene CYP2D6. Single-marker and haplotype-based analyses identified, with genomewide significance (P < 10-7), a 403-kb interval displaying strong linkage disequilibrium (LD) with predicted poor-metabolizer phenotype. However, the width of this interval makes the location of causal variants difficult: for example, the interval contains seven known or predicted genes in addition to CYP2D6. We have developed the Bayesian fine-mapping software coldmap, which, applied to these genotype data, yields a 95% location interval covering only 185 kb and establishes genomewide significance for a causal locus within the region. Strikingly, our interval correctly excludes four SNPs, which individually display association with genomewide significance, including the SNP showing strongest LD (P < 10-34). In addition, coldmap distinguishes homozygous cases for the major CYP2D6 mutation from those bearing minor mutations. We further investigate a selection of SNP subsets and find that previously reported methods lead to a 38% savings in SNPs at the cost of an increase of <20% in the width of the location interval.
Collapse
|
72
|
Abstract
We review Wright's original definitions of the genetic correlation coefficients F(ST), F(IT), and F(IS), pointing out ambiguities and the difficulties that these have generated. We also briefly survey some subsequent approaches to defining and estimating the coefficients. We then propose a general framework in which the coefficients are defined, their properties established, and likelihood-based inference implemented. Likelihood methods of inference are proposed both for bi-allelic and multi-allelic loci, within a hierarchical model which allows sharing of information both across subpopulations and across loci, but without assuming constancy in either case. This framework can be used, for example, to detect environment-related diversifying selection.
Collapse
|
73
|
Phillips MS, Lawrence R, Sachidanandam R, Morris AP, Balding DJ, Donaldson MA, Studebaker JF, Ankener WM, Alfisi SV, Kuo FS, Camisa AL, Pazorov V, Scott KE, Carey BJ, Faith J, Katari G, Bhatti HA, Cyr JM, Derohannessian V, Elosua C, Forman AM, Grecco NM, Hock CR, Kuebler JM, Lathrop JA, Mockler MA, Nachtman EP, Restine SL, Varde SA, Hozza MJ, Gelfand CA, Broxholme J, Abecasis GR, Boyce-Jacino MT, Cardon LR. Chromosome-wide distribution of haplotype blocks and the role of recombination hot spots. Nat Genet 2003; 33:382-7. [PMID: 12590262 DOI: 10.1038/ng1100] [Citation(s) in RCA: 217] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2002] [Accepted: 01/16/2003] [Indexed: 01/30/2023]
Abstract
Recent studies of human populations suggest that the genome consists of chromosome segments that are ancestrally conserved ('haplotype blocks'; refs. 1-3) and have discrete boundaries defined by recombination hot spots. Using publicly available genetic markers, we have constructed a first-generation haplotype map of chromosome 19. As expected for this marker density, approximately one-third of the chromosome is encompassed within haplotype blocks. Evolutionary modeling of the data indicates that recombination hot spots are not required to explain most of the observed blocks, providing that marker ascertainment and the observed marker spacing are considered. In contrast, several long blocks are inconsistent with our evolutionary models, and different mechanisms could explain their origins.
Collapse
|
74
|
Beaumont MA, Zhang W, Balding DJ. Approximate Bayesian computation in population genetics. Genetics 2002; 162:2025-2035. [PMID: 12524368 DOI: 10.0000/pmid12524368] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/25/2023] Open
Abstract
We propose a new method for approximate Bayesian statistical inference on the basis of summary statistics. The method is suited to complex problems that arise in population genetics, extending ideas developed in this setting by earlier authors. Properties of the posterior distribution of a parameter, such as its mean or density curve, are approximated without explicit likelihood calculations. This is achieved by fitting a local-linear regression of simulated parameter values on simulated summary statistics, and then substituting the observed summary statistics into the regression equation. The method combines many of the advantages of Bayesian statistical inference with the computational efficiency of methods based on summary statistics. A key advantage of the method is that the nuisance parameters are automatically integrated out in the simulation step, so that the large numbers of nuisance parameters that arise in population genetics problems can be handled without difficulty. Simulation results indicate computational and statistical efficiency that compares favorably with those of alternative methods previously proposed in the literature. We also compare the relative efficiency of inferences obtained using methods based on summary statistics with those obtained directly from the data using MCMC.
Collapse
|
75
|
Abstract
We propose a new method for approximate Bayesian statistical inference on the basis of summary statistics. The method is suited to complex problems that arise in population genetics, extending ideas developed in this setting by earlier authors. Properties of the posterior distribution of a parameter, such as its mean or density curve, are approximated without explicit likelihood calculations. This is achieved by fitting a local-linear regression of simulated parameter values on simulated summary statistics, and then substituting the observed summary statistics into the regression equation. The method combines many of the advantages of Bayesian statistical inference with the computational efficiency of methods based on summary statistics. A key advantage of the method is that the nuisance parameters are automatically integrated out in the simulation step, so that the large numbers of nuisance parameters that arise in population genetics problems can be handled without difficulty. Simulation results indicate computational and statistical efficiency that compares favorably with those of alternative methods previously proposed in the literature. We also compare the relative efficiency of inferences obtained using methods based on summary statistics with those obtained directly from the data using MCMC.
Collapse
|