151
|
Abstract
Coalescent theory deals with the dynamics of how sampled genetic material has spread through a population from a single ancestor over many generations and is ubiquitous in contemporary molecular population genetics. Inherent in most applications is a continuous-time approximation that is derived under the assumption that sample size is small relative to the actual population size. In effect, this precludes multiple and simultaneous coalescent events that take place in the history of large samples. If sequences do not recombine, the number of sequences ancestral to a large sample is reduced sufficiently after relatively few generations such that use of the continuous-time approximation is justified. However, in tracing the history of large chromosomal segments, a large recombination rate per generation will consistently maintain a large number of ancestors. This can create a major disparity between discrete-time and continuous-time models and we analyze its importance, illustrated with model parameters typical of the human genome. The presence of gene conversion exacerbates the disparity and could seriously undermine applications of coalescent theory to complete genomes. However, we show that multiple and simultaneous coalescent events influence global quantities, such as total number of ancestors, but have negligible effect on local quantities, such as linkage disequilibrium. Reassuringly, most applications of the coalescent model with recombination (including association mapping) focus on local quantities.
Collapse
|
152
|
Abstract
We consider inference for demographic models and parameters based upon postprocessing the output of an MCMC method that generates samples of genealogical trees (from the posterior distribution for a specific prior distribution of the genealogy). This approach has the advantage of taking account of the uncertainty in the inference for the tree when making inferences about the demographic model and can be computationally efficient in terms of reanalyzing data under a wide variety of models. We consider a (simulation-consistent) estimate of the likelihood for variable population size models, which uses importance sampling, and propose two new approximate likelihoods, one for migration models and one for continuous spatial models.
Collapse
Affiliation(s)
- Loukia Meligkotsidou
- Department of Mathematics and Statistics, Lancaster University, Lancaster, LA1 4YF, United Kingdom.
| | | |
Collapse
|
153
|
Woerner AE, Cox MP, Hammer MF. Recombination-filtered genomic datasets by information maximization. Bioinformatics 2007; 23:1851-3. [PMID: 17519249 DOI: 10.1093/bioinformatics/btm253] [Citation(s) in RCA: 154] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
UNLABELLED With the increasing amount of DNA sequence data available from natural populations, new computational methods are needed to efficiently process raw sequences into formats that are applicable to a variety of analytical methods. One highly successful approach to inferring aspects of demographic history is grounded in coalescent theory. Many of these methods restrict themselves to perfectly tree-like genealogies (i.e. regions with no observed recombination), because theoretical difficulties prevent ready statistical evaluation of recombining regions. However, determining which recombination-filtered dataset to analyze from a larger recombination-rich genomic region is a non-trivial problem. Current applications primarily aim to quantify recombination rates (rather than produce optimal recombination-filtered blocks), require significant manual intervention, and are impractical for multiple genomic datasets in high-throughput, automated research environments. Here, we present a fast, simple and automatable command-line program that extracts optimal recombination-filtered blocks (no four-gamete violations) from recombination-rich genomic re-sequence data. AVAILABILITY http://hammerlab.biosci.arizona.edu/software.html.
Collapse
Affiliation(s)
- August E Woerner
- Arizona Research Laboratories-Biotechnology, University of Arizona, Tucson, AZ 85721, USA
| | | | | |
Collapse
|
154
|
Calabrese P. A population genetics model with recombination hotspots that are heterogeneous across the population. Proc Natl Acad Sci U S A 2007; 104:4748-52. [PMID: 17360595 PMCID: PMC1838671 DOI: 10.1073/pnas.0610195104] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2006] [Indexed: 11/18/2022] Open
Abstract
Both sperm typing and linkage disequilibrium patterns from large population genetic data sets have demonstrated that recombination hotspots are responsible for much of the recombination activity in the human genome. Sperm typing has also revealed that some hotspots are heterogeneous in the population; and linkage disequilibrium patterns from the chimpanzee have implied that hotspots change at least on the separation time between these species. We propose a population genetics model, inspired by the double-strand break model, which features recombination hotspots that are heterogeneous across the population and whose population frequency changes with time. We have derived a diffusion approximation and written a coalescent simulation program. This model has implications for the "hotspot paradox."
Collapse
Affiliation(s)
- Peter Calabrese
- Molecular and Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089-2910, USA.
| |
Collapse
|
155
|
Griswold CK, Logsdon B, Gomulkiewicz R. Neutral evolution of multiple quantitative characters: a genealogical approach. Genetics 2007; 176:455-66. [PMID: 17339224 PMCID: PMC1893077 DOI: 10.1534/genetics.106.069658] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The G matrix measures the components of phenotypic variation that are genetically heritable. The structure of G, that is, its principal components and their associated variances, determines, in part, the direction and speed of multivariate trait evolution. In this article we present a framework and results that give the structure of G under the assumption of neutrality. We suggest that a neutral expectation of the structure of G is important because it gives a null expectation for the structure of G from which the unique consequences of selection can be determined. We demonstrate how the processes of mutation, recombination, and drift shape the structure of G. Furthermore, we demonstrate how shared common ancestry between segregating alleles shapes the structure of G. Our results show that shared common ancestry, which manifests itself in the form of a gene genealogy, causes the structure of G to be nonuniform in that the variances associated with the principal components of G decline at an approximately exponential rate. Furthermore we show that the extent of the nonuniformity in the structure of G is enhanced with declines in mutation rates, recombination rates, and numbers of loci and is dependent on the pattern and modality of mutation.
Collapse
Affiliation(s)
- Cortland K Griswold
- School of Biological Sciences, Washington State University, Pullman, Washington 99164, USA
| | | | | |
Collapse
|
156
|
Abstract
We describe a model-based method for using multilocus sequence data to infer the clonal relationships of bacteria and the chromosomal position of homologous recombination events that disrupt a clonal pattern of inheritance. The key assumption of our model is that recombination events introduce a constant rate of substitutions to a contiguous region of sequence. The method is applicable both to multilocus sequence typing (MLST) data from a few loci and to alignments of multiple bacterial genomes. It can be used to decide whether a subset of isolates share common ancestry, to estimate the age of the common ancestor, and hence to address a variety of epidemiological and ecological questions that hinge on the pattern of bacterial spread. It should also be useful in associating particular genetic events with the changes in phenotype that they cause. We show that the model outperforms existing methods of subdividing recombinogenic bacteria using MLST data and provide examples from Salmonella and Bacillus. The software used in this article, ClonalFrame, is available from http://bacteria.stats.ox.ac.uk/.
Collapse
Affiliation(s)
- Xavier Didelot
- Department of Statistics, University of Oxford, Oxford OX1 3SY, United Kingdom
| | | |
Collapse
|
157
|
Marjoram P, Tavaré S. Modern computational approaches for analysing molecular genetic variation data. Nat Rev Genet 2006; 7:759-70. [PMID: 16983372 DOI: 10.1038/nrg1961] [Citation(s) in RCA: 158] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
An explosive growth is occurring in the quantity, quality and complexity of molecular variation data that are being collected. Historically, such data have been analysed by using model-based methods. Models are useful for sharpening intuition, for explanation and for prediction: they add to our understanding of how the data were formed, and they can provide quantitative answers to questions of interest. We outline some of these model-based approaches, including the coalescent, and discuss the applicability of the computational methods that are necessary given the highly complex nature of current and future data sets.
Collapse
Affiliation(s)
- Paul Marjoram
- University of Southern California, Keck School of Medicine, Preventive Medicine, 1540 Alcazar Street, CHP-220, Los Angeles, California 90089-99011, USA
| | | |
Collapse
|
158
|
Mailund T, Besenbacher S, Schierup MH. Whole genome association mapping by incompatibilities and local perfect phylogenies. BMC Bioinformatics 2006; 7:454. [PMID: 17042942 PMCID: PMC1624851 DOI: 10.1186/1471-2105-7-454] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2006] [Accepted: 10/16/2006] [Indexed: 11/21/2022] Open
Abstract
Background With current technology, vast amounts of data can be cheaply and efficiently produced in association studies, and to prevent data analysis to become the bottleneck of studies, fast and efficient analysis methods that scale to such data set sizes must be developed. Results We present a fast method for accurate localisation of disease causing variants in high density case-control association mapping experiments with large numbers of cases and controls. The method searches for significant clustering of case chromosomes in the "perfect" phylogenetic tree defined by the largest region around each marker that is compatible with a single phylogenetic tree. This perfect phylogenetic tree is treated as a decision tree for determining disease status, and scored by its accuracy as a decision tree. The rationale for this is that the perfect phylogeny near a disease affecting mutation should provide more information about the affected/unaffected classification than random trees. If regions of compatibility contain few markers, due to e.g. large marker spacing, the algorithm can allow the inclusion of incompatibility markers in order to enlarge the regions prior to estimating their phylogeny. Haplotype data and phased genotype data can be analysed. The power and efficiency of the method is investigated on 1) simulated genotype data under different models of disease determination 2) artificial data sets created from the HapMap ressource, and 3) data sets used for testing of other methods in order to compare with these. Our method has the same accuracy as single marker association (SMA) in the simplest case of a single disease causing mutation and a constant recombination rate. However, when it comes to more complex scenarios of mutation heterogeneity and more complex haplotype structure such as found in the HapMap data our method outperforms SMA as well as other fast, data mining approaches such as HapMiner and Haplotype Pattern Mining (HPM) despite being significantly faster. For unphased genotype data, an initial step of estimating the phase only slightly decreases the power of the method. The method was also found to accurately localise the known susceptibility variants in an empirical data set – the ΔF508 mutation for cystic fibrosis – where the susceptibility variant is already known – and to find significant signals for association between the CYP2D6 gene and poor drug metabolism, although for this dataset the highest association score is about 60 kb from the CYP2D6 gene. Conclusion Our method has been implemented in the Blossoc (BLOck aSSOCiation) software. Using Blossoc, genome wide chip-based surveys of 3 million SNPs in 1000 cases and 1000 controls can be analysed in less than two CPU hours.
Collapse
Affiliation(s)
- Thomas Mailund
- Department of Statistics, University of Oxford, UK
- Bioinformatics Research Center, University of Aarhus, Denmark
| | | | | |
Collapse
|
159
|
Hellenthal G, Stephens M. Insights into recombination from population genetic variation. Curr Opin Genet Dev 2006; 16:565-72. [PMID: 17049225 DOI: 10.1016/j.gde.2006.10.001] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2006] [Accepted: 10/04/2006] [Indexed: 11/20/2022]
Abstract
Patterns of genetic variation in natural populations are shaped by, and hence carry valuable information about, the underlying recombination process. In the past five years, the increasing availability of large-scale population genetic data on dense sets of markers, coupled with advances in statistical methods for extracting information from these data, have led to several important advances in our understanding of the recombination process in humans. These advances include the identification of large numbers of 'hotspots', where recombination appears to take place considerably more frequently than in the surrounding sequence, and the identification of DNA sequence motifs that are associated with the locations of these hotspots.
Collapse
Affiliation(s)
- Garrett Hellenthal
- Department of Statistics, University of Washington, Seattle, WA 98195, USA
| | | |
Collapse
|
160
|
Li J, Zhang MQ, Zhang X. A new method for detecting human recombination hotspots and its applications to the HapMap ENCODE data. Am J Hum Genet 2006; 79:628-39. [PMID: 16960799 PMCID: PMC1592557 DOI: 10.1086/508066] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2006] [Accepted: 07/25/2006] [Indexed: 11/03/2022] Open
Abstract
Computational detection of recombination hotspots from population polymorphism data is important both for understanding the nature of recombination and for applications such as association studies. We propose a new method for this task based on a multiple-hotspot model and an (approximate) log-likelihood ratio test. A truncated, weighted pairwise log-likelihood is introduced and applied to the calculation of the log-likelihood ratio, and a forward-selection procedure is adopted to search for the optimal hotspot predictions. The method shows a relatively high power with a low false-positive rate in detecting multiple hotspots in simulation data and has a performance comparable to the best results of leading computational methods in experimental data for which recombination hotspots have been characterized by sperm-typing experiments. The method can be applied to both phased and unphased data directly, with a very fast computational speed. We applied the method to the 10 500-kb regions of the HapMap ENCODE data and found 172 hotspots among the three populations, with average hotspot width of 2.4 kb. By comparisons with the simulation data, we found some evidence that hotspots are not all identical across populations. The correlations between detected hotspots and several genomic characteristics were examined. In particular, we observed that DNaseI-hypersensitive sites are enriched in hotspots, suggesting the existence of human beta hotspots similar to those found in yeast.
Collapse
Affiliation(s)
- Jun Li
- Bioinformatics Division, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China
| | | | | |
Collapse
|
161
|
Abstract
Determining the evolutionary relationships between fossil hominid groups such as Neanderthals and modern humans has been a question of enduring interest in human evolutionary genetics. Here we present a new method for addressing whether archaic human groups contributed to the modern gene pool (called ancient admixture), using the patterns of variation in contemporary human populations. Our method improves on previous work by explicitly accounting for recent population history before performing the analyses. Using sequence data from the Environmental Genome Project, we find strong evidence for ancient admixture in both a European and a West African population (p ≈ 10−7), with contributions to the modern gene pool of at least 5%. While Neanderthals form an obvious archaic source population candidate in Europe, there is not yet a clear source population candidate in West Africa. Determining the evolutionary relationships between modern humans and fossil hominine groups such as Neanderthals has been a question of enduring interest in human evolutionary genetics. In this paper, Plagnol and Wall present a new method for addressing whether archaic human groups contributed to the modern gene pool. Using sequence data from the Environmental Genome Project, they find strong evidence for ancient admixture in both a European and a West African population, with contributions to the modern gene pool of at least 5%. While Neanderthals form an obvious archaic source population candidate in Europe, there is not yet a clear source population candidate in West Africa. The authors' results have direct implications for the competing models of modern human origins. In particular, their estimates of non-negligible contributions of archaic populations to the modern gene pool are inconsistent with strict forms of the Recent African Origin model, which posits that modern humans evolved in a single location in Africa and from there spread and replaced all other existing hominines.
Collapse
Affiliation(s)
- Vincent Plagnol
- Department of Molecular and Computational Biology, University of Southern California, Los Angeles, California, USA.
| | | |
Collapse
|
162
|
Padhukasahasram B, Wall JD, Marjoram P, Nordborg M. Estimating recombination rates from single-nucleotide polymorphisms using summary statistics. Genetics 2006; 174:1517-28. [PMID: 16980396 PMCID: PMC1667054 DOI: 10.1534/genetics.106.060723] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
We describe a novel method for jointly estimating crossing-over and gene-conversion rates from population genetic data using summary statistics. The performance of our method was tested on simulated data sets and compared with the composite-likelihood method of R. R. Hudson. For several realistic parameter values, the new method performed similarly to the composite-likelihood approach for estimating crossing-over rates and better when estimating gene-conversion rates. We used our method to analyze a human data set recently genotyped by Perlegen Sciences.
Collapse
Affiliation(s)
- Badri Padhukasahasram
- Molecular and Computational Biology and Biostatistics Division, Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, California 90089, USA.
| | | | | | | |
Collapse
|
163
|
Wiuf C. Consistency of estimators of population scaled parameters using composite likelihood. J Math Biol 2006; 53:821-41. [PMID: 16960689 DOI: 10.1007/s00285-006-0031-0] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2005] [Revised: 07/17/2006] [Indexed: 11/28/2022]
Abstract
Composite likelihood methods have become very popular for the analysis of large-scale genomic data sets because of the computational intractability of the basic coalescent process and its generalizations: It is virtually impossible to calculate the likelihood of an observed data set spanning a large chromosomal region without using approximate or heuristic methods. Composite likelihood methods are approximate methods and, in the present article, assume the likelihood is written as a product of likelihoods, one for each of a number of smaller regions that together make up the whole region from which data is collected. A very general framework for neutral coalescent models is presented and discussed. The framework comprises many of the most popular coalescent models that are currently used for analysis of genetic data. Assume data is collected from a series of consecutive regions of equal size. Then it is shown that the observed data forms a stationary, ergodic process. General conditions are given under which the maximum composite estimator of the parameters describing the model (e.g. mutation rates, demographic parameters and the recombination rate) is a consistent estimator as the number of regions tends to infinity.
Collapse
Affiliation(s)
- Carsten Wiuf
- Bioinformatics Research Center, University of Aarhus, Høegh-Guldbergsgade 10, Building 1090, 8000 Aarhus C, Denmark.
| |
Collapse
|
164
|
Song YS, Lyngsø R, Hein J. Counting all possible ancestral configurations of sample sequences in population genetics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2006; 3:239-51. [PMID: 17048462 DOI: 10.1109/tcbb.2006.31] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Given a set D of input sequences, a genealogy for D can be constructed backward in time using such evolutionary events as mutation, coalescent, and recombination. An ancestral configuration (AC) can be regarded as the multiset of all sequences present at a particular point in time in a possible genealogy for D. The complexity of computing the likelihood of observing D depends heavily on the total number of distinct ACs of D and, therefore, it is of interest to estimate that number. For D consisting of binary sequences of finite length, we consider the problem of enumerating exactly all distinct ACs. We assume that the root sequence type is known and that the mutation process is governed by the infinite-sites model. When there is no recombination, we construct a general method of obtaining closed-form formulas for the total number of ACs. The enumeration problem becomes much more complicated when recombination is involved. In that case, we devise a method of enumeration based on counting contingency tables and construct a dynamic programming algorithm for the approach. Last, we describe a method of counting the number of ACs that can appear in genealogies with less than or equal to a given number R of recombinations. Of particular interest is the case in which R is close to the minimum number of recombinations for D.
Collapse
Affiliation(s)
- Yun S Song
- Department of Computer Science, University of California at Davis, 2063 Kemper Hall, One Shields Avenue, Davis, CA 95616, USA.
| | | | | |
Collapse
|
165
|
Morrell PL, Toleno DM, Lundy KE, Clegg MT. Estimating the contribution of mutation, recombination and gene conversion in the generation of haplotypic diversity. Genetics 2006; 173:1705-23. [PMID: 16624913 PMCID: PMC1526701 DOI: 10.1534/genetics.105.054502] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2005] [Accepted: 04/11/2006] [Indexed: 11/18/2022] Open
Abstract
Recombination occurs through both homologous crossing over and homologous gene conversion during meiosis. The contribution of recombination relative to mutation is expected to be dramatically reduced in inbreeding organisms. We report coalescent-based estimates of the recombination parameter (rho) relative to estimates of the mutation parameter (theta) for 18 genes from the highly self-fertilizing grass, wild barley, Hordeum vulgare ssp. spontaneum. Estimates of rho/theta are much greater than expected, with a mean rho/theta approximately 1.5, similar to estimates from outcrossing species. We also estimate rho with and without the contribution of gene conversion. Genotyping errors can mimic the effect of gene conversion, upwardly biasing estimates of the role of conversion. Thus we report a novel method for identifying genotyping errors in nucleotide sequence data sets. We show that there is evidence for gene conversion in many large nucleotide sequence data sets including our data that have been purged of all detectable sequencing errors and in data sets from Drosophila melanogaster, D. simulans, and Zea mays. In total, 13 of 27 loci show evidence of gene conversion. For these loci, gene conversion is estimated to contribute an average of twice as much as crossing over to total recombination.
Collapse
Affiliation(s)
- Peter L Morrell
- Department of Ecology and Evolutionary Biology, University of California, Irvine, California 92697, USA
| | | | | | | |
Collapse
|
166
|
Griffiths RC, Lessard S. Ewens' sampling formula and related formulae: combinatorial proofs, extensions to variable population size and applications to ages of alleles. Theor Popul Biol 2006; 68:167-77. [PMID: 15913688 DOI: 10.1016/j.tpb.2005.02.004] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2004] [Revised: 01/18/2005] [Accepted: 02/10/2005] [Indexed: 11/18/2022]
Abstract
Ewens' sampling formula, the probability distribution of a configuration of alleles in a sample of genes under the infinitely-many-alleles model of mutation, is proved by a direct combinatorial argument. The distribution is extended to a model where the population size may vary back in time. The distribution of age-ordered frequencies in the population is also derived in the model, extending the GEM distribution of age-ordered frequencies in a model with a constant-sized population. The genealogy of a rare allele is studied using a combinatorial approach. A connection is explored between the distribution of age-ordered frequencies and ladder indices and heights in a sequence of random variables. In a sample of n genes the connection is with ladder heights and indices in a sequence of draws from an urn containing balls labelled 1,2,...,n; and in the population the connection is with ladder heights and indices in a sequence of independent uniform random variables.
Collapse
Affiliation(s)
- Robert C Griffiths
- Department of Statistics, University of Oxford, 1 South Parks Rd, Oxford OX1 3TG, UK.
| | | |
Collapse
|
167
|
Wiuf C, Brameier M, Hagberg O, Stumpf MPH. A likelihood approach to analysis of network data. Proc Natl Acad Sci U S A 2006; 103:7566-70. [PMID: 16682633 PMCID: PMC1472487 DOI: 10.1073/pnas.0600061103] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Biological, sociological, and technological network data are often analyzed by using simple summary statistics, such as the observed degree distribution, and nonparametric bootstrap procedures to provide an adequate null distribution for testing hypotheses about the network. In this article we present a full-likelihood approach that allows us to estimate parameters for general models of network growth that can be expressed in terms of recursion relations. To handle larger networks we have developed an importance sampling scheme that allows us to approximate the likelihood and draw inference about the network and how it has been generated, estimate the parameters in the model, and perform parametric bootstrap analysis of network data. We illustrate the power of this approach by estimating growth parameters for the Caenorhabditis elegans protein interaction network.
Collapse
Affiliation(s)
- Carsten Wiuf
- Bioinformatics Research Center, University of Aarhus, Høegh-Guldbergsgade 10, Building 1090, 8000 Aarhus C, Denmark.
| | | | | | | |
Collapse
|
168
|
Lawrence R, Evans DM, Morris AP, Ke X, Hunt S, Paolucci M, Ragoussis J, Deloukas P, Bentley D, Cardon LR. Genetically indistinguishable SNPs and their influence on inferring the location of disease-associated variants. Genome Res 2006; 15:1503-10. [PMID: 16251460 PMCID: PMC1310638 DOI: 10.1101/gr.4217605] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
As part of a recent high-density linkage disequilibrium (LD) study of chromosome 20, we obtained genotypes for approximately 30,000 SNPs at a density of 1 SNP/2 kb on four different population samples (47 CEPH founders; 91 UK unrelateds [unrelated white individuals of western European ancestry]; 97 African Americans; 42 East Asians). We observed that approximately 50% of SNPs had at least one genetically indistinguishable partner; i.e., for every individual considered, their genotype at the first locus was identical to their genotype at the second locus, or in LD terms, the SNPs were in "perfect" LD (r2 = 1.0). These "genetically indistinguishable SNPs" (giSNPs) formed into clusters of varying size. The larger the cluster, the greater the tendency to be located within genes and to overlap with giSNP clusters in other population samples. As might be expected for this map density, many giSNPs were located close to one another, thus reflecting local regions of undetected recombination or haplotype blocks. However, approximately 1/3 of giSNP clusters had intermingled, non-indistinguishable SNPs with incomplete LD (D' and r2 <1), sometimes spanning hundreds of kilobases, comprising up to 70 indistinguishable markers and overlapping multiple haplotype blocks. These long-range, nonconsecutive giSNPs have implications for disease gene localization by allelic association as evidence for association at one locus will be indistinguishable from that at another locus, even though both loci may be situated far apart. We describe the distribution of giSNPs on this map of chromosome 20 and illustrate the potential impact they can have on association mapping.
Collapse
Affiliation(s)
- Robert Lawrence
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom
| | | | | | | | | | | | | | | | | | | |
Collapse
|
169
|
Aylor DL, Price EW, Carbone I. SNAP: Combine and Map modules for multilocus population genetic analysis. Bioinformatics 2006; 22:1399-401. [PMID: 16601003 DOI: 10.1093/bioinformatics/btl136] [Citation(s) in RCA: 101] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
We have added two software tools to our Suite of Nucleotide Analysis Programs (SNAP) for working with DNA sequences sampled from populations. SNAP Map collapses DNA sequence data into unique haplotypes, extracts variable sites and manipulates output into multiple formats for input into existing software packages for evolutionary analyses. Map collapses DNA sequence data into unique haplotypes, extracts variable sites and manipulates output into multiple formats for input into existing software packages for evolutionary analyses. Map includes novel features such as recoding insertions or deletions, including or excluding variable sites that violate an infinite-sites model and the option of collapsing sequences with corresponding phenotypic information, important in testing for significant haplotype-phenotype associations. SNAP Combine merges multiple DNA sequence alignments into a single multiple alignment file. The resulting file can be the union or intersection of the input files. SNAP Combine currently reads from and writes to several sequence alignment file formats including both sequential and interleaved formats. Combine also keeps track of the start and end positions of each separate alignment file allowing the user to exclude variable sites or taxa, important in creating input files for multilocus analyses.
Collapse
Affiliation(s)
- David L Aylor
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27695, USA
| | | | | |
Collapse
|
170
|
Bruen TC, Philippe H, Bryant D. A simple and robust statistical test for detecting the presence of recombination. Genetics 2006; 172:2665-81. [PMID: 16489234 PMCID: PMC1456386 DOI: 10.1534/genetics.105.048975] [Citation(s) in RCA: 954] [Impact Index Per Article: 53.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2005] [Accepted: 02/03/2006] [Indexed: 11/18/2022] Open
Abstract
Recombination is a powerful evolutionary force that merges historically distinct genotypes. But the extent of recombination within many organisms is unknown, and even determining its presence within a set of homologous sequences is a difficult question. Here we develop a new statistic, phi(w), that can be used to test for recombination. We show through simulation that our test can discriminate effectively between the presence and absence of recombination, even in diverse situations such as exponential growth (star-like topologies) and patterns of substitution rate correlation. A number of other tests, Max chi2, NSS, a coalescent-based likelihood permutation test (from LDHat), and correlation of linkage disequilibrium (both r2 and /D'/) with distance, all tend to underestimate the presence of recombination under strong population growth. Moreover, both Max chi2 and NSS falsely infer the presence of recombination under a simple model of mutation rate correlation. Results on empirical data show that our test can be used to detect recombination between closely as well as distantly related samples, regardless of the suspected rate of recombination. The results suggest that phi(w) is one of the best approaches to distinguish recurrent mutation from recombination in a wide variety of circumstances.
Collapse
Affiliation(s)
- Trevor C Bruen
- McGill Centre for Bioinformatics, McGill University, Montreal, Quebec, Canada.
| | | | | |
Collapse
|
171
|
Song YS. A concise necessary and sufficient condition for the existence of a galled-tree. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2006; 3:186-91. [PMID: 17048404 DOI: 10.1109/tcbb.2006.15] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Galled-trees are a special class of graphical representation of evolutionary history that has proven amenable to efficient, polynomial-time algorithms. The goal of this paper is to construct a concise necessary and sufficient condition for the existence of a galled-tree for M, a set of binary sequences that purportedly have evolved in the presence of recombination. Both root-known and root-unknown cases are considered here.
Collapse
Affiliation(s)
- Yun S Song
- Department of Computer Science, University of California at Davis, Davis, CA 95616, USA.
| |
Collapse
|
172
|
Abstract
We have developed a Bayesian version of our likelihood-based Markov chain Monte Carlo genealogy sampler LAMARC and compared the two versions for estimation of theta = 4N(e)mu, exponential growth rate, and recombination rate. We used simulated DNA data to assess accuracy of means and support or credibility intervals. In all cases the two methods had very similar results. Some parameter combinations led to overly narrow support or credibility intervals, excluding the truth more often than the desired percentage, for both methods. However, the Bayesian approach rejected the generative parameter values significantly less often than the likelihood approach, both in cases where the level of rejection was normal and in cases where it was too high.
Collapse
Affiliation(s)
- Mary K Kuhner
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA.
| | | |
Collapse
|
173
|
Carvajal-Rodríguez A, Crandall KA, Posada D. Recombination estimation under complex evolutionary models with the coalescent composite-likelihood method. Mol Biol Evol 2006; 23:817-27. [PMID: 16452117 PMCID: PMC1949848 DOI: 10.1093/molbev/msj102] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
The composite-likelihood estimator (CLE) of the population recombination rate considers only sites with exactly two alleles under a finite-sites mutation model (McVean, G. A. T., P. Awadalla, and P. Fearnhead. 2002. A coalescent-based method for detecting and estimating recombination from gene sequences. Genetics 160:1231-1241). While in such a model the identity of alleles is not considered, the CLE has been shown to be robust to minor misspecification of the underlying mutational model. However, there are many situations where the putative mutation and demographic history can be quite complex. One good example is rapidly evolving pathogens, like HIV-1. First we evaluated the performance of the CLE and the likelihood permutation test (LPT) under more complex, realistic models, including a general time reversible (GTR) substitution model, rate heterogeneity among sites (Gamma), positive selection, population growth, population structure, and noncontemporaneous sampling. Second, we relaxed some of the assumptions of the CLE allowing for a four-allele, GTR + Gamma model in an attempt to use the data more efficiently. Through simulations and the analysis of real data, we concluded that the CLE is robust to severe misspecifications of the substitution model, but underestimates the recombination rate in the presence of exponential growth, population mixture, selection, or noncontemporaneous sampling. In such cases, the use of more complex models slightly increases performance in some occasions, especially in the case of the LPT. Thus, our results provide for a more robust application of the estimation of recombination rates.
Collapse
|
174
|
De Iorio M, de Silva E, Stumpf MP. Recombination hotspots as a point process. Philos Trans R Soc Lond B Biol Sci 2006; 360:1597-603. [PMID: 16096109 PMCID: PMC1569526 DOI: 10.1098/rstb.2005.1690] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The variation of the recombination rate along chromosomal DNA is one of the important determinants of the patterns of linkage disequilibrium. A number of inferential methods have been developed which estimate the recombination rate and its variation from population genetic data. The majority of these methods are based on modelling the genealogical process underlying a sample of DNA sequences and thus explicitly include a model of the demographic process. Here we propose a different inferential procedure based on a previously introduced framework where recombination is modelled as a point process along a DNA sequence. The approach infers regions containing putative hotspots based on the inferred minimum number of recombination events; it thus depends only indirectly on the underlying population demography. A Poisson point process model with local rates is then used to infer patterns of recombination rate estimation in a fully Bayesian framework. We illustrate this new approach by applying it to several population genetic datasets, including a region with an experimentally confirmed recombination hotspot.
Collapse
Affiliation(s)
- Maria De Iorio
- Department of Epidemiology and Public Health, Faculty of Medicine, Imperial College LondonSt Mary's Campus, Norfolk Place, London W2 1PG, UK
| | - Eric de Silva
- Centre for Bioinformatics, Division of Molecular Biosciences, Imperial College LondonLondon SW7 2AZ, UK
| | - Michael P.H Stumpf
- Centre for Bioinformatics, Division of Molecular Biosciences, Imperial College LondonLondon SW7 2AZ, UK
- Author for correspondence ()
| |
Collapse
|
175
|
Felsenstein J. Accuracy of Coalescent Likelihood Estimates: Do We Need More Sites, More Sequences, or More Loci? Mol Biol Evol 2005; 23:691-700. [PMID: 16364968 DOI: 10.1093/molbev/msj079] [Citation(s) in RCA: 205] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
A computer simulation study has been made of the accuracy of estimates of Theta = 4Nemu from a sample from a single isolated population of finite size. The accuracies turn out to be well predicted by a formula developed by Fu and Li, who used optimistic assumptions. Their formulas are restated in terms of accuracy, defined here as the reciprocal of the squared coefficient of variation. This should be proportional to sample size when the entities sampled provide independent information. Using these formulas for accuracy, the sampling strategy for estimation of Theta can be investigated. Two models for cost have been used, a cost-per-base model and a cost-per-read model. The former would lead us to prefer to have a very large number of loci, each one base long. The latter, which is more realistic, causes us to prefer to have one read per locus and an optimum sample size which declines as costs of sampling organisms increase. For realistic values, the optimum sample size is 8 or fewer individuals. This is quite close to the results obtained by Pluzhnikov and Donnelly for a cost-per-base model, evaluating other estimators of Theta. It can be understood by considering that the resources spent collecting larger samples prevent us from considering more loci. An examination of the efficiency of Watterson's estimator of Theta was also made, and it was found to be reasonably efficient when the number of mutants per generation in the sequence in the whole population is less than 2.5.
Collapse
Affiliation(s)
- Joseph Felsenstein
- Department of Genome Sciences and Department of Biology, University of Washington, Seattle, USA.
| |
Collapse
|
176
|
Clarke GM, Cardon LR. Disentangling linkage disequilibrium and linkage from dense single-nucleotide polymorphism trio data. Genetics 2005; 171:2085-95. [PMID: 16118185 PMCID: PMC1456135 DOI: 10.1534/genetics.105.047431] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2005] [Accepted: 08/02/2005] [Indexed: 11/18/2022] Open
Abstract
Parent-offspring trios are widely collected for disease gene-mapping studies and are being extensively genotyped as part of the International HapMap Project. With dense maps of markers on trios, the effects of LD and linkage can be separated, allowing estimation of recombination rates in a model-free setting. Here we define a model-free multipoint method on the basis of dense sequence polymorphism data from parent-offspring trios to estimate intermarker recombination rates. We use simulations to show that this method has up to 92% power to detect recombination hotspots of intensity 25 times background over a region of size 10 kb typed at density 1 marker per 2.5 kb and almost 100% power to detect large hotspots of intensity >125 times background over regions of size 10 kb typed with just 1 marker per 5 kb (alpha = 0.05). We found strong agreement at megabase scales between estimates from our method applied to HapMap trio data and estimates from the genetic map. At finer scales, using Centre d'Etude du Polymorphisme Humain (CEPH) pedigree data across a 10-Mb region of chromosome 20, a comparison of population recombination rate estimates obtained from our method with estimates obtained using a coalescent-based approximate-likelihood method implemented in PHASE 2.0 shows detection of the same coldspots and most hotspots: The Spearman rank correlation between the estimates from our method and those from PHASE is 0.58 (p < 2.2(-16)).
Collapse
Affiliation(s)
- Geraldine M Clarke
- Wellcome Trust Centre for Human Genetics, Oxford University, Roosevelt Drive, Oxford OX3 7BN, United Kingdom.
| | | |
Collapse
|
177
|
Abstract
Correlation of gene histories in the human genome determines the patterns of genetic variation (haplotype structure) and is crucial to understanding genetic factors in common diseases. We derive closed analytical expressions for the correlation of gene histories in established demographic models for genetic evolution and show how to extend the analysis to more realistic (but more complicated) models of demographic structure. We identify two contributions to the correlation of gene histories in divergent populations: linkage disequilibrium, and differences in the demographic history of individuals in the sample. These two factors contribute to correlations at different length scales: the former at small, and the latter at large scales. We show that recent mixing events in divergent populations limit the range of correlations and compare our findings to empirical results on the correlation of gene histories in the human genome.
Collapse
Affiliation(s)
- A Eriksson
- Department of Physical Resource Theory, Chalmers and Göteborg University, Sweden
| | | |
Collapse
|
178
|
Clark TG, De Iorio M, Griffiths RC, Farrall M. Finding associations in dense genetic maps: a genetic algorithm approach. Hum Hered 2005; 60:97-108. [PMID: 16220001 DOI: 10.1159/000088845] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2005] [Accepted: 07/26/2005] [Indexed: 11/19/2022] Open
Abstract
Large-scale association studies hold promise for discovering the genetic basis of common human disease. These studies will consist of a large number of individuals, as well as large number of genetic markers, such as single nucleotide polymorphisms (SNPs). The potential size of the data and the resulting model space require the development of efficient methodology to unravel associations between phenotypes and SNPs in dense genetic maps. Our approach uses a genetic algorithm (GA) to construct logic trees consisting of Boolean expressions involving strings or blocks of SNPs. These blocks or nodes of the logic trees consist of SNPs in high linkage disequilibrium (LD), that is, SNPs that are highly correlated with each other due to evolutionary processes. At each generation of our GA, a population of logic tree models is modified using selection, cross-over and mutation moves. Logic trees are selected for the next generation using a fitness function based on the marginal likelihood in a Bayesian regression frame-work. Mutation and cross-over moves use LD measures to pro pose changes to the trees, and facilitate the movement through the model space. We demonstrate our method and the flexibility of logic tree structure with variable nodal lengths on simulated data from a coalescent model, as well as data from a candidate gene study of quantitative genetic variation.
Collapse
Affiliation(s)
- Taane G Clark
- Department of Epidemiology and Public Health, Imperial College, St. Mary's Campus, Norfolk Place, London W2 1PG, UK.
| | | | | | | |
Collapse
|
179
|
Zhu L, Bustamante CD. A composite-likelihood approach for detecting directional selection from DNA sequence data. Genetics 2005; 170:1411-21. [PMID: 15879513 PMCID: PMC1451173 DOI: 10.1534/genetics.104.035097] [Citation(s) in RCA: 77] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2004] [Accepted: 03/30/2005] [Indexed: 11/18/2022] Open
Abstract
We present a novel composite-likelihood-ratio test (CLRT) for detecting genes and genomic regions that are subject to recurrent natural selection (either positive or negative). The method uses the likelihood functions of Hartl et al. (1994) for inference in a Wright-Fisher genic selection model and corrects for nonindependence among sites by application of coalescent simulations with recombination. Here, we (1) characterize the distribution of the CLRT statistic (Lambda) as a function of the population recombination rate (R=4Ner); (2) explore the effects of bias in estimation of R on the size (type I error) of the CLRT; (3) explore the robustness of the model to population growth, bottlenecks, and migration; (4) explore the power of the CLRT under varying levels of mutation, selection, and recombination; (5) explore the discriminatory power of the test in distinguishing negative selection from population growth; and (6) evaluate the performance of maximum composite-likelihood estimation (MCLE) of the selection coefficient. We find that the test has excellent power to detect weak negative selection and moderate power to detect positive selection. Moreover, the test is quite robust to bias in the estimate of local recombination rate, but not to certain demographic scenarios such as population growth or a recent bottleneck. Last, we demonstrate that the MCLE of the selection parameter has little bias for weak negative selection and has downward bias for positively selected mutations.
Collapse
Affiliation(s)
| | - Carlos D. Bustamante
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853
| |
Collapse
|
180
|
Gasbarra D, Sillanpää MJ, Arjas E. Backward simulation of ancestors of sampled individuals. Theor Popul Biol 2005; 67:75-83. [PMID: 15713321 DOI: 10.1016/j.tpb.2004.08.003] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2003] [Indexed: 11/27/2022]
Abstract
If the population is large and the sampling mechanism is random, the coalescent is commonly used to model the haplotypes in the sample. Ordered genotypes can then be formed by random matching of the derived haplotypes. However, this approach is not realistic when (1) there is departure from random mating (e.g., dominant individuals in breeding populations or monogamy in humans), or (2) the population is small and/or the individuals in the sample are ascertained by applying some particular non-random sampling scheme, as is usually the case when considering the statistical modeling and analysis of pedigree data. For such situations, we present here a data generation method where an ancestral graph with non-overlapping generations is first generated backwards in time, using ideas from coalescent theory. Alleles are randomly assigned to the founders, and subsequently the gene flow over the entire genome is simulated forwards in time by dropping alleles down the graph according to recombination model without interference. The parameters controlling the mating behavior of generated individuals in the graph (degree of monogamy) can be tuned in order to match a particular demographic situation, without restriction to simple random mating. The performance of the approach is illustrated with a simulation example. The software (written in C-language) is freely available for research purposes at http://www.rni.helsinki.fi/~dag/.
Collapse
Affiliation(s)
- Dario Gasbarra
- Department of Mathematics and Statistics, Rolf Nevanlinna Institute, University of Helsinki, P.O. Box 68, FIN-00014 Helsinki, Finland
| | | | | |
Collapse
|
181
|
Smith NGC, Fearnhead P. A comparison of three estimators of the population-scaled recombination rate: accuracy and robustness. Genetics 2005; 171:2051-62. [PMID: 15956675 PMCID: PMC1456127 DOI: 10.1534/genetics.104.036293] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
We have performed simulations to assess the performance of three population genetics approximate-likelihood methods in estimating the population-scaled recombination rate from sequence data. We measured performance in two ways: accuracy when the sequence data were simulated according to the (simplistic) standard model underlying the methods and robustness to violations of many different aspects of the standard model. Although we found some differences between the methods, performance tended to be similar for all three methods. Despite the fact that the methods are not robust to violations of the underlying model, our simulations indicate that patterns of relative recombination rates should be inferred reasonably well even if the standard model does not hold. In addition, we assess various techniques for improving the performance of approximate-likelihood methods. In particular we find that the composite-likelihood method of Hudson (2001) can be improved by including log-likelihood contributions only for pairs of sites that are separated by some prespecified distance.
Collapse
Affiliation(s)
- Nick G C Smith
- Department of Mathematics and Statistics, Lancaster University, Lancaster LA1 4YF, United Kingdom
| | | |
Collapse
|
182
|
Jensen JD, Kim Y, DuMont VB, Aquadro CF, Bustamante CD. Distinguishing between selective sweeps and demography using DNA polymorphism data. Genetics 2005; 170:1401-10. [PMID: 15911584 PMCID: PMC1451184 DOI: 10.1534/genetics.104.038224] [Citation(s) in RCA: 185] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
In 2002 Kim and Stephan proposed a promising composite-likelihood method for localizing and estimating the fitness advantage of a recently fixed beneficial mutation. Here, we demonstrate that their composite-likelihood-ratio (CLR) test comparing selective and neutral hypotheses is not robust to undetected population structure or a recent bottleneck, with some parameter combinations resulting in a false positive rate of nearly 90%. We also propose a goodness-of-fit test for discriminating rejections due to directional selection (true positive) from those due to population and demographic forces (false positives) and demonstrate that the new method has high sensitivity to differentiate the two classes of rejections.
Collapse
Affiliation(s)
- Jeffrey D. Jensen
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853
| | - Yuseob Kim
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853
| | - Vanessa Bauer DuMont
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853
| | - Charles F. Aquadro
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853
- Corresponding author: Department of Molecular Biology and Genetics, 235 Biotechnology Bldg., Cornell University, Ithaca, NY 14850. E-mail:
| | - Carlos D. Bustamante
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853
| |
Collapse
|
183
|
Anderson EC. An efficient Monte Carlo method for estimating Ne from temporally spaced samples using a coalescent-based likelihood. Genetics 2005; 170:955-67. [PMID: 15834143 PMCID: PMC1450415 DOI: 10.1534/genetics.104.038349] [Citation(s) in RCA: 52] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
This article presents an efficient importance-sampling method for computing the likelihood of the effective size of a population under the coalescent model of Berthier et al. Previous computational approaches, using Markov chain Monte Carlo, required many minutes to several hours to analyze small data sets. The approach presented here is orders of magnitude faster and can provide an approximation to the likelihood curve, even for large data sets, in a matter of seconds. Additionally, confidence intervals on the estimated likelihood curve provide a useful estimate of the Monte Carlo error. Simulations show the importance sampling to be stable across a wide range of scenarios and show that the N(e) estimator itself performs well. Further simulations show that the 95% confidence intervals around the N(e) estimate are accurate. User-friendly software implementing the algorithm for Mac, Windows, and Unix/Linux is available for download. Applications of this computational framework to other problems are discussed.
Collapse
Affiliation(s)
- Eric C Anderson
- Southwest Fisheries Science Center, National Marine Fisheries Service, Santa Cruz, California 95060, USA.
| |
Collapse
|
184
|
Abstract
Haplotypes have played a major role in the study of highly-penetrant single-gene disorders, and recent evidence that the human genome has hot-spots and cold-spots for recombination have suggested that haplotype-based methods may play a key role in the study of common complex traits. This report reviews the motivation of using haplotypes for the study of the genetic basis of human traits, ranging from biologic function, to statistical power advantages of haplotypes, to linkage disequilibrium fine-mapping. Recent developments of regression models for haplotype analyses are reviewed, offering a synthesis of current methods, as well as their limitations and areas that require further research. Regression models provide significant advantages, such as the ability to control for non-genetic covariates, the effects of the haplotypes can be modeled, step-wise selection can be used to screen for a subset of markers that explain most of the association, haplotype x environment interactions can be evaluated, and regression diagnostics are well developed. Despite these strengths, the current regression methods tend to lack the sophisticated population genetic perspectives offered by coalescent and other similar approaches. Future work that links regression methods with population genetic models may prove beneficial.
Collapse
Affiliation(s)
- Daniel J Schaid
- Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, Minnesota 55905, USA.
| |
Collapse
|
185
|
Takebayashi N, Newbigin E, Uyenoyama MK. Maximum-likelihood estimation of rates of recombination within mating-type regions. Genetics 2005; 167:2097-109. [PMID: 15342543 PMCID: PMC1471000 DOI: 10.1534/genetics.103.021535] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Features common to many mating-type regions include recombination suppression over large genomic tracts and cosegregation of genes of various functions, not necessarily related to reproduction. Model systems for homomorphic self-incompatibility (SI) in flowering plants share these characteristics. We introduce a method for the exact computation of the joint probability of numbers of neutral mutations segregating at the determinant of mating type and at a linked marker locus. The underlying Markov model incorporates strong balancing selection into a two-locus coalescent. We apply the method to obtain a maximum-likelihood estimate of the rate of recombination between a marker locus, 48A, and S-RNase, the determinant of SI specificity in pistils of Nicotiana alata. Even though the sampled haplotypes show complete allelic linkage disequilibrium and recombinants have never been detected, a highly significant deficiency of synonymous substitutions at 48A compared to S-RNase suggests a history of recombination. Our maximum-likelihood estimate indicates a rate of recombination of perhaps 3 orders of magnitude greater than the rate of synonymous mutation. This approach may facilitate the construction of genetic maps of regions tightly linked to targets of strong balancing selection.
Collapse
Affiliation(s)
- Naoki Takebayashi
- Department of Biology, Duke University, Durham, North Carolina 27708-0338, USA
| | | | | |
Collapse
|
186
|
Fearnhead P, Harding RM, Schneider JA, Myers S, Donnelly P. Application of coalescent methods to reveal fine-scale rate variation and recombination hotspots. Genetics 2005; 167:2067-81. [PMID: 15342541 PMCID: PMC1470991 DOI: 10.1534/genetics.103.021584] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
There has been considerable recent interest in understanding the way in which recombination rates vary over small physical distances, and the extent of recombination hotspots, in various genomes. Here we adapt, apply, and assess the power of recently developed coalescent-based approaches to estimating recombination rates from sequence polymorphism data. We apply full-likelihood estimation to study rate variation in and around a well-characterized recombination hotspot in humans, in the beta-globin gene cluster, and show that it provides similar estimates, consistent with those from sperm studies, from two populations deliberately chosen to have different demographic and selectional histories. We also demonstrate how approximate-likelihood methods can be used to detect local recombination hotspots from genomic-scale SNP data. In a simulation study based on 80 100-kb regions, these methods detect 43 out of 60 hotspots (ranging from 1 to 2 kb in size), with only two false positives out of 2000 subregions that were tested for the presence of a hotspot. Our study suggests that new computational tools for sophisticated analysis of population diversity data are valuable for hotspot detection and fine-scale mapping of local recombination rates.
Collapse
Affiliation(s)
- Paul Fearnhead
- Department of Mathematics and Statistics, Lancaster University, Lancaster LA1 4YF, United Kingdom
| | | | | | | | | |
Collapse
|
187
|
Abstract
We introduce a new method for jointly estimating crossing-over and gene conversion rates using sequence polymorphism data. The method calculates probabilities for subsets of the data consisting of three segregating sites and then forms a composite likelihood by multiplying together the probabilities of many subsets. Simulations show that this new method performs better than previously proposed methods for estimating gene conversion rates, but that all methods require large amounts of data to provide reliable estimates. While existing methods can easily estimate an "average" gene conversion rate over many loci, they cannot reliably estimate gene conversion rates for a single region of the genome.
Collapse
Affiliation(s)
- Jeffrey D Wall
- Program in Molecular and Computational Biology, University of Southern California, Los Angeles, California 90089, USA.
| |
Collapse
|
188
|
Lemey P, Pybus OG, Rambaut A, Drummond AJ, Robertson DL, Roques P, Worobey M, Vandamme AM. The molecular population genetics of HIV-1 group O. Genetics 2005; 167:1059-68. [PMID: 15280223 PMCID: PMC1470933 DOI: 10.1534/genetics.104.026666] [Citation(s) in RCA: 82] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
HIV-1 group O originated through cross-species transmission of SIV from chimpanzees to humans and has established a relatively low prevalence in Central Africa. Here, we infer the population genetics and epidemic history of HIV-1 group O from viral gene sequence data and evaluate the effect of variable evolutionary rates and recombination on our estimates. First, model selection tools were used to specify suitable evolutionary and coalescent models for HIV group O. Second, divergence times and population genetic parameters were estimated in a Bayesian framework using Markov chain Monte Carlo sampling, under both strict and relaxed molecular clock methods. Our results date the origin of the group O radiation to around 1920 (1890-1940), a time frame similar to that estimated for HIV-1 group M. However, group O infections, which remain almost wholly restricted to Cameroon, show a slower rate of exponential growth during the twentieth century, explaining their lower current prevalence. To explore the effect of recombination, the Bayesian framework is extended to incorporate multiple unlinked loci. Although recombination can bias estimates of the time to the most recent common ancestor, this effect does not appear to be important for HIV-1 group O. In addition, we show that evolutionary rate estimates for different HIV genes accurately reflect differential selective constraints along the HIV genome.
Collapse
Affiliation(s)
- Philippe Lemey
- Rega Institute for Medical Research, KULeuven, B-3000 Leuven, Belgium.
| | | | | | | | | | | | | | | |
Collapse
|
189
|
Coop G, Griffiths RC. Ancestral inference on gene trees under selection. Theor Popul Biol 2005; 66:219-32. [PMID: 15465123 DOI: 10.1016/j.tpb.2004.06.006] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2004] [Indexed: 10/26/2022]
Abstract
The extent to which natural selection shapes diversity within populations is a key question for population genetics. Thus, there is considerable interest in quantifying the strength of selection. A full likelihood approach for inference about selection at a single site within an otherwise neutral fully linked sequence of sites is described here. A coalescent model of evolution is used to model the ancestry of a sample of DNA sequences which have the selected site segregating. The mutation model, for the selected and neutral sites, is the infinitely many-sites model where there is no back or parallel mutation at sites. A unique perfect phylogeny, a gene tree, can be constructed from the configuration of mutations on the sample sequences under this model of mutation. The approach is general and can be used for any bi-allelic selection scheme. Selection is incorporated through modelling the frequency of the selected and neutral allelic classes stochastically back in time, then using a subdivided population model considering the population frequencies through time as variable population sizes. An importance sampling algorithm is then used to explore over coalescent tree space consistent with the data. The method is applied to a simulated data set and the gene tree presented in Verrelli et al. (2002).
Collapse
Affiliation(s)
- Graham Coop
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford OX1 3TG, UK.
| | | |
Collapse
|
190
|
Innan H, Zhang K, Marjoram P, Tavaré S, Rosenberg NA. Statistical tests of the coalescent model based on the haplotype frequency distribution and the number of segregating sites. Genetics 2005; 169:1763-77. [PMID: 15654103 PMCID: PMC1449527 DOI: 10.1534/genetics.104.032219] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Several tests of neutral evolution employ the observed number of segregating sites and properties of the haplotype frequency distribution as summary statistics and use simulations to obtain rejection probabilities. Here we develop a "haplotype configuration test" of neutrality (HCT) based on the full haplotype frequency distribution. To enable exact computation of rejection probabilities for small samples, we derive a recursion under the standard coalescent model for the joint distribution of the haplotype frequencies and the number of segregating sites. For larger samples, we consider simulation-based approaches. The utility of the HCT is demonstrated in simulations of alternative models and in application to data from Drosophila melanogaster.
Collapse
Affiliation(s)
- Hideki Innan
- Human Genetics Center, School of Public Health, University of Texas Health Science Center, Houston, 77030, USA
| | | | | | | | | |
Collapse
|
191
|
Carbone I, Liu YC, Hillman BI, Milgroom MG. Recombination and migration of Cryphonectria hypovirus 1 as inferred from gene genealogies and the coalescent. Genetics 2005; 166:1611-29. [PMID: 15126384 PMCID: PMC1470819 DOI: 10.1534/genetics.166.4.1611] [Citation(s) in RCA: 79] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Genealogy-based methods were used to estimate migration of the fungal virus Cryphonectria hypovirus 1 between vegetative compatibility types of the host fungus, Cryphonectria parasitica, as a means of estimating horizontal transmission within two host populations. Vegetative incompatibility is a self/non-self recognition system that inhibits virus transmission under laboratory conditions but its effect on transmission in nature has not been clearly demonstrated. Recombination within and among different loci in the virus genome restricted the genealogical analyses to haplotypes with common mutation and recombinational histories. The existence of recombination necessitated that we also use genealogical approaches that can take advantage of both the mutation and recombinational histories of the sample. Virus migration between populations was significantly restricted. In contrast, estimates of migration between vegetative compatibility types were relatively high within populations despite previous evidence that transmission in the laboratory was restricted. The discordance between laboratory estimates and migration estimates from natural populations highlights the challenges in estimating pathogen transmission rates. Genealogical analyses inferred migration patterns throughout the entire coalescent history of one viral region in natural populations and not just recent patterns of migration or laboratory transmission. This application of genealogical analyses provides markedly stronger inferences on overall transmission rates than laboratory estimates do.
Collapse
Affiliation(s)
- Ignazio Carbone
- Department of Plant Pathology, North Carolina State University, Raleigh, North Carolina 27695, USA
| | | | | | | |
Collapse
|
192
|
Improved Recombination Lower Bounds for Haplotype Data. LECTURE NOTES IN COMPUTER SCIENCE 2005. [DOI: 10.1007/11415770_43] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
|
193
|
Beckmann L, Thomas DC, Fischer C, Chang-Claude J. Haplotype Sharing Analysis Using Mantel Statistics. Hum Hered 2005; 59:67-78. [PMID: 15838176 DOI: 10.1159/000085221] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2004] [Accepted: 11/15/2004] [Indexed: 12/29/2022] Open
Abstract
OBJECTIVE The potential value of haplotypes has attracted widespread interest in the mapping of complex traits. Haplotype sharing methods take the linkage disequilibrium information between multiple markers into account, and may have good power to detect predisposing genes. We present a new approach based on Mantel statistics for spacetime clustering, which is developed in order to improve the power of haplotype sharing analysis for gene mapping in complex disease. METHODS The new statistic correlates genetic similarity and phenotypic similarity across pairs of haplotypes for case-only and case-control studies. The genetic similarity is measured as the shared length between haplotypes around a putative disease locus. The phenotypic similarity is measured as the mean-corrected cross-product based on the respective phenotypes. We analyzed two tests for statistical significance with respect to type I error: (1) assuming asymptotic normality, and (2) using a Monte Carlo permutation procedure. The results were compared to the chi(2) test for association based on 3-marker haplotypes. RESULTS The results of the type I error rates for the Mantel statistics using the permutational procedure yielded pointwise valid tests. The approach based on the assumption of asymptotic normality was seriously liberal. CONCLUSION Power comparisons showed that the Mantel statistics were better than or equal to the chi(2) test for all simulated disease models.
Collapse
Affiliation(s)
- L Beckmann
- German Cancer Research Center DKFZ, DE-69120 Heidelberg, Germany
| | | | | | | |
Collapse
|
194
|
Abstract
We outline a general coalescent framework for using genotype data in linkage disequilibrium-based mapping studies. Our approach unifies two main goals of gene mapping that have generally been treated separately in the past: detecting association (i.e., significance testing) and estimating the location of the causative variation. To tackle the problem, we separate the inference into two stages. First, we use Markov chain Monte Carlo to sample from the posterior distribution of coalescent genealogies of all the sampled chromosomes without regard to phenotype. Then, averaging across genealogies, we estimate the likelihood of the phenotype data under various models for mutation and penetrance at an unobserved disease locus. The essential signal that these models look for is that in the presence of disease susceptibility variants in a region, there is nonrandom clustering of the chromosomes on the tree according to phenotype. The extent of nonrandom clustering is captured by the likelihood and can be used to construct significance tests or Bayesian posterior distributions for location. A novelty of our framework is that it can naturally accommodate quantitative data. We describe applications of the method to simulated data and to data from a Mendelian locus (CFTR, responsible for cystic fibrosis) and from a proposed complex trait locus (calpain-10, implicated in type 2 diabetes).
Collapse
Affiliation(s)
- Sebastian Zöllner
- Department of Human Genetics, University of Chicago, Chicago, Illinois 60637, USA.
| | | |
Collapse
|
195
|
Linder CR, Rieseberg LH. Reconstructing patterns of reticulate evolution in plants. AMERICAN JOURNAL OF BOTANY 2004; 91:1700-8. [PMID: 21652318 DOI: 10.3732/ajb.91.10.1700] [Citation(s) in RCA: 222] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
Until recently, rigorously reconstructing the many hybrid speciation events in plants has not been practical because of the limited number of molecular markers available for plant phylogenetic reconstruction and the lack of good, biologically based methods for inferring reticulation (network) events. This situation should change rapidly with the development of multiple nuclear markers for phylogenetic reconstruction and new methods for reconstructing reticulate evolution. These developments will necessitate a much greater incorporation of population genetics into phylogenetic reconstruction than has been common. Population genetic events such as gene duplication coupled with lineage sorting and meiotic and sexual recombination have always had the potential to affect phylogenetic inference. For tree reconstruction, these problems are usually minimized by using uniparental markers and nuclear markers that undergo rapid concerted evolution. Because reconstruction of reticulate speciation events will require nuclear markers that lack these characteristics, effects of population genetics on phylogenetic inference will need to be addressed directly. Current models and methods that allow hybrid speciation to be detected and reconstructed are discussed, with a focus on how lineage sorting and meiotic and sexual recombination affect network reconstruction. Approaches that would allow inference of phylogenetic networks in their presence are suggested.
Collapse
Affiliation(s)
- C Randal Linder
- Section of Integrative Biology and the Center for Computational Biology and Bioinformatics, University of Texas-Austin, 1 University Station-A6700, Austin, Texas 78712 USA
| | | |
Collapse
|
196
|
Graham J, McNeney B, Seillier-Moiseiwitsch F. Stepwise detection of recombination breakpoints in sequence alignments. Bioinformatics 2004; 21:589-95. [PMID: 15388518 DOI: 10.1093/bioinformatics/bti040] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION We propose a stepwise approach to identify recombination breakpoints in a sequence alignment. The approach can be applied to any recombination detection method that uses a permutation test and provides estimates of breakpoints. RESULTS We illustrate the approach by analyses of a simulated dataset and alignments of real data from HIV-1 and human chromosome 7. The presented simulation results compare the statistical properties of one-step and two-step procedures. More breakpoints are found with a two-step procedure than with a single application of a given method, particularly for higher recombination rates. At higher recombination rates, the additional breakpoints were located at the cost of only a slight increase in the number of falsely declared breakpoints. However, a large proportion of breakpoints still go undetected. AVAILABILITY A makefile and C source code for phylogenetic profiling and the maximum chi2 method, tested with the gcc compiler on Linux and WindowsXP, are available at http://stat-db.stat.sfu.ca/stepwise/ CONTACT jgraham@stat.sfu.ca.
Collapse
Affiliation(s)
- Jinko Graham
- Department of Statistics and Actuarial Science, Simon Fraser University Burnaby, Canada V5A 1S6.
| | | | | |
Collapse
|
197
|
Price EW, Carbone I. SNAP: workbench management tool for evolutionary population genetic analysis. Bioinformatics 2004; 21:402-4. [PMID: 15353448 DOI: 10.1093/bioinformatics/bti003] [Citation(s) in RCA: 114] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
UNLABELLED The reconstruction of population processes from DNA sequence variation requires the coordinated implementation of several coalescent-based methods, each bound by specific assumptions and limitations. In practice, the application of these coalescent-based methods for parameter estimation is difficult because they make strict assumptions that must be verified a priori and their parameter-rich nature makes the estimation of all model parameters very complex and computationally intensive. A further complication is their distribution as console applications that require the user to navigate through console menus or specify complex command-line arguments. To facilitate the implementation of these coalescent-based tools we developed SNAP Workbench, a Java program that manages and coordinates a series of programs. The workbench enhances population parameter estimation by ensuring that the assumptions and program limitations of each method are met and by providing a step-by-step methodology for examining population processes that integrates both summary-statistic methods and coalescent-based population genetic models. AVAILABILITY SNAP Workbench is freely available at http://snap.cifr.ncsu.edu. The workbench and tools can be downloaded for Mac, Windows and Unix operating systems. Each package includes installation instructions, program documentation and a sample dataset. SUPPLEMENTARY INFORMATION A description of system requirements and installation instructions can be found at http://snap.cifr.ncsu.edu.
Collapse
Affiliation(s)
- Eric W Price
- Center for Integrated Fungal Research, Department of Plant Pathology, North Carolina State University, Raleigh, NC 27695, USA
| | | |
Collapse
|
198
|
Morris AP, Whittaker JC, Balding DJ. Little loss of information due to unknown phase for fine-scale linkage-disequilibrium mapping with single-nucleotide-polymorphism genotype data. Am J Hum Genet 2004; 74:945-53. [PMID: 15077198 PMCID: PMC1181987 DOI: 10.1086/420773] [Citation(s) in RCA: 57] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2004] [Accepted: 02/12/2004] [Indexed: 11/03/2022] Open
Abstract
We present the results of a simulation study that indicate that true haplotypes at multiple, tightly linked loci often provide little extra information for linkage-disequilibrium fine mapping, compared with the information provided by corresponding genotypes, provided that an appropriate statistical analysis method is used. In contrast, a two-stage approach to analyzing genotype data, in which haplotypes are inferred and then analyzed as if they were true haplotypes, can lead to a substantial loss of information. The study uses our COLDMAP software for fine mapping, which implements a Markov chain-Monte Carlo algorithm that is based on the shattered coalescent model of genetic heterogeneity at a disease locus. We applied COLDMAP to 100 replicate data sets simulated under each of 18 disease models. Each data set consists of haplotype pairs (diplotypes) for 20 SNPs typed at equal 50-kb intervals in a 950-kb candidate region that includes a single disease locus located at random. The data sets were analyzed in three formats: (1). as true haplotypes; (2). as haplotypes inferred from genotypes using an expectation-maximization algorithm; and (3). as unphased genotypes. On average, true haplotypes gave a 6% gain in efficiency compared with the unphased genotypes, whereas inferring haplotypes from genotypes led to a 20% loss of efficiency, where efficiency is defined in terms of root mean integrated square error of the location of the disease locus. Furthermore, treating inferred haplotypes as if they were true haplotypes leads to considerable overconfidence in estimates, with nominal 50% credibility intervals achieving, on average, only 19% coverage. We conclude that (1). given appropriate statistical analyses, the costs of directly measuring haplotypes will rarely be justified by a gain in the efficiency of fine mapping and that (2). a two-stage approach of inferring haplotypes followed by a haplotype-based analysis can be very inefficient for fine mapping, compared with an analysis based directly on the genotypes.
Collapse
Affiliation(s)
- A P Morris
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, United Kingdom.
| | | | | |
Collapse
|
199
|
Bafna V, Bansal V. The number of recombination events in a sample history: conflict graph and lower bounds. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2004; 1:78-90. [PMID: 17048383 DOI: 10.1109/tcbb.2004.23] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
We consider the following problem: Given a set of binary sequences, determine lower bounds on the minimum number of recombinations required to explain the history of the sample, under the infinite-sites model of mutation. The problem has implications for finding recombination hotspots and for the Ancestral Recombination Graph reconstruction problem. Hudson and Kaplan gave a lower bound based on the four-gamete test. In practice, their bound Rm often greatly underestimates the minimum number of recombinations. The problem was recently revisited by Myers and Griffiths, who introduced two new lower bounds Rh and Rs which are provably better, and also yield good bounds in practice. However, the worst-case complexities of their procedures for computing Rh and Rs are exponential and super-exponential, respectively. In this paper, we show that the number of nontrivial connected components, Rc, in the conflict graph for a given set of sequences, computable in time O(nm2), is also a lower bound on the minimum number of recombination events. We show that in many cases, Rc is a better bound than Rh. The conflict graph was used by Gusfield et al. to obtain a polynomial time algorithm for the galled tree problem, which is a special case of the Ancestral Recombination Graph (ARG) reconstruction problem. Our results also offer some insight into the structural properties of this graph and are of interest for the general Ancestral Recombination Graph reconstruction problem.
Collapse
Affiliation(s)
- Vineet Bafna
- Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA 92093-0114, USA.
| | | |
Collapse
|
200
|
Carbone I, Liu YC, Hillman BI, Milgroom MG. Recombination and Migration of Cryphonectria hypovirus 1 as Inferred From Gene Genealogies and the Coalescent. Genetics 2004. [DOI: 10.1093/genetics/166.4.1611] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Abstract
Genealogy-based methods were used to estimate migration of the fungal virus Cryphonectria hypovirus 1 between vegetative compatibility types of the host fungus, Cryphonectria parasitica, as a means of estimating horizontal transmission within two host populations. Vegetative incompatibility is a self/non-self recognition system that inhibits virus transmission under laboratory conditions but its effect on transmission in nature has not been clearly demonstrated. Recombination within and among different loci in the virus genome restricted the genealogical analyses to haplotypes with common mutation and recombinational histories. The existence of recombination necessitated that we also use genealogical approaches that can take advantage of both the mutation and recombinational histories of the sample. Virus migration between populations was significantly restricted. In contrast, estimates of migration between vegetative compatibility types were relatively high within populations despite previous evidence that transmission in the laboratory was restricted. The discordance between laboratory estimates and migration estimates from natural populations highlights the challenges in estimating pathogen transmission rates. Genealogical analyses inferred migration patterns throughout the entire coalescent history of one viral region in natural populations and not just recent patterns of migration or laboratory transmission. This application of genealogical analyses provides markedly stronger inferences on overall transmission rates than laboratory estimates do.
Collapse
Affiliation(s)
- Ignazio Carbone
- Center for Integrated Fungal Research, Department of Plant Pathology, North Carolina State University, Raleigh, North Carolina 27695
| | - Yir-Chung Liu
- Department of Plant Pathology, Cornell University, Ithaca, New York 14853
| | - Bradley I Hillman
- Department of Plant Pathology, Rutgers University, New Brunswick, New Jersey 08901
| | - Michael G Milgroom
- Department of Plant Pathology, Cornell University, Ithaca, New York 14853
| |
Collapse
|