1
|
Williams TA, Cox CJ, Foster PG, Szöllősi GJ, Embley TM. Phylogenomics provides robust support for a two-domains tree of life. Nat Ecol Evol 2020; 4:138-147. [PMID: 31819234 PMCID: PMC6942926 DOI: 10.1038/s41559-019-1040-x] [Citation(s) in RCA: 125] [Impact Index Per Article: 31.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2019] [Accepted: 10/15/2019] [Indexed: 11/09/2022]
Abstract
Hypotheses about the origin of eukaryotic cells are classically framed within the context of a universal 'tree of life' based on conserved core genes. Vigorous ongoing debate about eukaryote origins is based on assertions that the topology of the tree of life depends on the taxa included and the choice and quality of genomic data analysed. Here we have reanalysed the evidence underpinning those claims and apply more data to the question by using supertree and coalescent methods to interrogate >3,000 gene families in archaea and eukaryotes. We find that eukaryotes consistently originate from within the archaea in a two-domains tree when due consideration is given to the fit between model and data. Our analyses support a close relationship between eukaryotes and Asgard archaea and identify the Heimdallarchaeota as the current best candidate for the closest archaeal relatives of the eukaryotic nuclear lineage.
Collapse
Affiliation(s)
- Tom A Williams
- School of Biological Sciences, University of Bristol, Bristol, UK.
| | - Cymon J Cox
- Centro de Ciências do Mar, Universidade do Algarve, Faro, Portugal
| | - Peter G Foster
- Department of Life Sciences, Natural History Museum, London, UK
| | - Gergely J Szöllősi
- MTA-ELTE "Lendület" Evolutionary Genomics Research Group, Budapest, Hungary
- Department of Biological Physics, Eötvös Loránd University, Budapest, Hungary
- Evolutionary Systems Research Group, Centre for Ecological Research, Hungarian Academy of Sciences, Tihany, Hungary
| | - T Martin Embley
- Institute for Cell and Molecular Biosciences, University of Newcastle, Newcastle upon Tyne, UK.
| |
Collapse
|
2
|
Weber CC, Whelan S. Physicochemical Amino Acid Properties Better Describe Substitution Rates in Large Populations. Mol Biol Evol 2019; 36:679-690. [PMID: 30668757 DOI: 10.1093/molbev/msz003] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Substitutions between chemically distant amino acids are known to occur less frequently than those between more similar amino acids. This knowledge, however, is not reflected in most codon substitution models, which treat all nonsynonymous changes as if they were equivalent in terms of impact on the protein. A variety of methods for integrating chemical distances into models have been proposed, with a common approach being to divide substitutions into radical or conservative categories. Nevertheless, it remains unclear whether the resulting models describe sequence evolution better than their simpler counterparts. We propose a parametric codon model that distinguishes between radical and conservative substitutions, allowing us to assess if radical substitutions are preferentially removed by selection. Applying our new model to a range of phylogenomic data, we find differentiating between radical and conservative substitutions provides significantly better fit for large populations, but see no equivalent improvement for smaller populations. Comparing codon and amino acid models using these same data shows that alignments from large populations tend to select phylogenetic models containing information about amino acid exchangeabilities, whereas the structure of the genetic code is more important for smaller populations. Our results suggest selection against radical substitutions is, on average, more pronounced in large populations than smaller ones. The reduced observable effect of selection in smaller populations may be due to stronger genetic drift making it more challenging to detect preferences. Our results imply an important connection between the life history of a phylogenetic group and the model that best describes its evolution.
Collapse
Affiliation(s)
- Claudia C Weber
- Center for Computational Genetics and Genomics, Department of Biology, Temple University, Philadelphia, PA.,European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, United Kingdom
| | - Simon Whelan
- Evolutionary Biology Center, Uppsala University, Uppsala, Sweden
| |
Collapse
|
3
|
Beaulieu JM, O’Meara BC, Zaretzki R, Landerer C, Chai J, Gilchrist MA. Population Genetics Based Phylogenetics Under Stabilizing Selection for an Optimal Amino Acid Sequence: A Nested Modeling Approach. Mol Biol Evol 2019; 36:834-851. [PMID: 30521036 PMCID: PMC6445302 DOI: 10.1093/molbev/msy222] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
We present a new phylogenetic approach, selection on amino acids and codons (SelAC), whose substitution rates are based on a nested model linking protein expression to population genetics. Unlike simpler codon models that assume a single substitution matrix for all sites, our model more realistically represents the evolution of protein-coding DNA under the assumption of consistent, stabilizing selection using a cost-benefit approach. This cost-benefit approach allows us to generate a set of 20 optimal amino acid-specific matrix families using just a handful of parameters and naturally links the strength of stabilizing selection to protein synthesis levels, which we can estimate. Using a yeast data set of 100 orthologs for 6 taxa, we find SelAC fits the data much better than popular models by 104-105 Akike information criterion units adjusted for small sample bias. Our results also indicated that nested, mechanistic models better predict observed data patterns highlighting the improvement in biological realism in amino acid sequence evolution that our model provides. Additional parameters estimated by SelAC indicate that a large amount of nonphylogenetic, but biologically meaningful, information can be inferred from existing data. For example, SelAC prediction of gene-specific protein synthesis rates correlates well with both empirical (r=0.33-0.48) and other theoretical predictions (r=0.45-0.64) for multiple yeast species. SelAC also provides estimates of the optimal amino acid at each site. Finally, because SelAC is a nested approach based on clearly stated biological assumptions, future modifications, such as including shifts in the optimal amino acid sequence within or across lineages, are possible.
Collapse
Affiliation(s)
- Jeremy M Beaulieu
- Department of Biological Sciences, University of Arkansas, Fayetteville, AR
- Department of Ecology & Evolutionary Biology, University of Tennessee, Knoxville, TN
- National Institute for Mathematical and Biological Synthesis, Knoxville, TN
| | - Brian C O’Meara
- Department of Ecology & Evolutionary Biology, University of Tennessee, Knoxville, TN
- National Institute for Mathematical and Biological Synthesis, Knoxville, TN
| | | | - Cedric Landerer
- Department of Ecology & Evolutionary Biology, University of Tennessee, Knoxville, TN
- National Institute for Mathematical and Biological Synthesis, Knoxville, TN
| | - Juanjuan Chai
- National Institute for Mathematical and Biological Synthesis, Knoxville, TN
- Suite 1039, White Plains, NY
| | - Michael A Gilchrist
- Department of Ecology & Evolutionary Biology, University of Tennessee, Knoxville, TN
- National Institute for Mathematical and Biological Synthesis, Knoxville, TN
| |
Collapse
|
4
|
Goldstein RA, Pollock DD. The tangled bank of amino acids. Protein Sci 2016; 25:1354-62. [PMID: 27028523 PMCID: PMC4918418 DOI: 10.1002/pro.2930] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2016] [Revised: 03/24/2016] [Accepted: 03/24/2016] [Indexed: 12/01/2022]
Abstract
The use of amino acid substitution matrices to model protein evolution has yielded important insights into both the evolutionary process and the properties of specific protein families. In order to make these models tractable, standard substitution matrices represent the average results of the evolutionary process rather than the underlying molecular biophysics and population genetics, treating proteins as a set of independently evolving sites rather than as an integrated biomolecular entity. With advances in computing and the increasing availability of sequence data, we now have an opportunity to move beyond current substitution matrices to more interpretable mechanistic models with greater fidelity to the evolutionary process of mutation and selection and the holistic nature of the selective constraints. As part of this endeavour, we consider how epistatic interactions induce spatial and temporal rate heterogeneity, and demonstrate how these generally ignored factors can reconcile standard substitution rate matrices and the underlying biology, allowing us to better understand the meaning of these substitution rates. Using computational simulations of protein evolution, we can demonstrate the importance of both spatial and temporal heterogeneity in modelling protein evolution.
Collapse
Affiliation(s)
- Richard A Goldstein
- Division of Infection and Immunity, University College London, London, WC1E 6BT, UK
| | - David D Pollock
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, Colorado, 80045
| |
Collapse
|
5
|
Kück P, Wägele JW. Plesiomorphic character states cause systematic errors in molecular phylogenetic analyses: a simulation study. Cladistics 2015; 32:461-478. [DOI: 10.1111/cla.12132] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/15/2015] [Indexed: 01/17/2023] Open
Affiliation(s)
- Patrick Kück
- The Natural History Museum Cromwell Road SW7 5BD London UK
| | - J. Wolfgang Wägele
- Zoologisches Forschungsmuseum Alexander Koenig Adenauerallee 160 53113 Bonn Germany
| |
Collapse
|
6
|
Money D, Whelan S. GeLL: a generalized likelihood library for phylogenetic models. Bioinformatics 2015; 31:2391-3. [PMID: 25725494 DOI: 10.1093/bioinformatics/btv126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2014] [Accepted: 02/23/2015] [Indexed: 11/13/2022] Open
Abstract
UNLABELLED Phylogenetic models are an important tool in molecular evolution allowing us to study the pattern and rate of sequence change. The recent influx of new sequence data in the biosciences means that to address evolutionary questions, we need a means for rapid and easy model development and implementation. Here we present GeLL, a Java library that lets users use text to quickly and efficiently define novel forms of discrete data and create new substitution models that describe how those data change on a phylogeny. GeLL allows users to define general substitution models and data structures in a way that is not possible in other existing libraries, including mixture models and non-reversible models. Classes are provided for calculating likelihoods, optimizing model parameters and branch lengths, ancestral reconstruction and sequence simulation. AVAILABILITY AND IMPLEMENTATION http://phylo.bio.ku.edu/GeLL under a GPL v3 license.
Collapse
Affiliation(s)
- Daniel Money
- Department of Plant and Animal Sciences, Faculty of Agriculture, Dalhousie University, Truro, B2N 5E3 Canada, Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS, 66045, USA and
| | - Simon Whelan
- Department of Evolutionary Biology, Uppsala University, Uppsala, 75236, Sweden
| |
Collapse
|
7
|
Md Mukarram Hossain AS, Blackburne BP, Shah A, Whelan S. Evidence of Statistical Inconsistency of Phylogenetic Methods in the Presence of Multiple Sequence Alignment Uncertainty. Genome Biol Evol 2015; 7:2102-16. [PMID: 26139831 PMCID: PMC4558847 DOI: 10.1093/gbe/evv127] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Evolutionary studies usually use a two-step process to investigate sequence data. Step one estimates a multiple sequence alignment (MSA) and step two applies phylogenetic methods to ask evolutionary questions of that MSA. Modern phylogenetic methods infer evolutionary parameters using maximum likelihood or Bayesian inference, mediated by a probabilistic substitution model that describes sequence change over a tree. The statistical properties of these methods mean that more data directly translates to an increased confidence in downstream results, providing the substitution model is adequate and the MSA is correct. Many studies have investigated the robustness of phylogenetic methods in the presence of substitution model misspecification, but few have examined the statistical properties of those methods when the MSA is unknown. This simulation study examines the statistical properties of the complete two-step process when inferring sequence divergence and the phylogenetic tree topology. Both nucleotide and amino acid analyses are negatively affected by the alignment step, both through inaccurate guide tree estimates and through overfitting to that guide tree. For many alignment tools these effects become more pronounced when additional sequences are added to the analysis. Nucleotide sequences are particularly susceptible, with MSA errors leading to statistical support for long-branch attraction artifacts, which are usually associated with gross substitution model misspecification. Amino acid MSAs are more robust, but do tend to arbitrarily resolve multifurcations in favor of the guide tree. No inference strategies produce consistently accurate estimates of divergence between sequences, although amino acid MSAs are again more accurate than their nucleotide counterparts. We conclude with some practical suggestions about how to limit the effect of MSA uncertainty on evolutionary inference.
Collapse
Affiliation(s)
- A S Md Mukarram Hossain
- Faculty of Life Sciences, University of Manchester, United Kingdom Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Sweden
| | | | - Abhijeet Shah
- Faculty of Life Sciences, University of Manchester, United Kingdom
| | - Simon Whelan
- Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Sweden
| |
Collapse
|
8
|
Abstract
Numerous computational methods exist to assess the mode and strength of natural selection in protein-coding sequences, yet how distinct methods relate to one another remains largely unknown. Here, we elucidate the relationship between two widely used phylogenetic modeling frameworks: dN/dS models and mutation-selection (MutSel) models. We derive a mathematical relationship between dN/dS and scaled selection coefficients, the focal parameters of MutSel models, and use this relationship to gain deeper insight into the behaviors, limitations, and applicabilities of these two modeling frameworks. We prove that, if all synonymous changes are neutral, standard MutSel models correspond to dN/dS ≤ 1. However, if synonymous codons differ in fitness, dN/dS can take on arbitrarily high values even if all selection is purifying. Thus, the MutSel modeling framework cannot necessarily accommodate positive, diversifying selection, while dN/dS cannot distinguish between purifying selection on synonymous codons and positive selection on amino acids. We further propose a new benchmarking strategy of dN/dS inferences against MutSel simulations and demonstrate that the widely used Goldman-Yang-style dN/dS models yield substantially biased dN/dS estimates on realistic sequence data. In contrast, the less frequently used Muse-Gaut-style models display much less bias. Strikingly, the least-biased and most precise dN/dS estimates are never found in the models with the best fit to the data, measured through both AIC and BIC scores. Thus, selecting models based on goodness-of-fit criteria can yield poor parameter estimates if the models considered do not precisely correspond to the underlying mechanism that generated the data. In conclusion, establishing mathematical links among modeling frameworks represents a novel, powerful strategy to pinpoint previously unrecognized model limitations and strengths.
Collapse
Affiliation(s)
- Stephanie J Spielman
- Department of Integrative Biology, Center for Computational Biology and Bioinformatics, and Institute of Cellular and Molecular Biology, The University of Texas at Austin
| | - Claus O Wilke
- Department of Integrative Biology, Center for Computational Biology and Bioinformatics, and Institute of Cellular and Molecular Biology, The University of Texas at Austin
| |
Collapse
|
9
|
Brown JM. Detection of implausible phylogenetic inferences using posterior predictive assessment of model fit. Syst Biol 2014; 63:334-48. [PMID: 24415681 DOI: 10.1093/sysbio/syu002] [Citation(s) in RCA: 60] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Systematic phylogenetic error caused by the simplifying assumptions made in models of molecular evolution may be impossible to avoid entirely when attempting to model evolution across massive, diverse data sets. However, not all deficiencies of inference models result in unreliable phylogenetic estimates. The field of phylogenetics lacks a direct method to identify cases where model specification adversely affects inferences. Posterior predictive simulation is a flexible and intuitive approach for assessing goodness-of-fit of the assumed model and priors in a Bayesian phylogenetic analysis. Here, I propose new test statistics for use in posterior predictive assessment of model fit. These test statistics compare phylogenetic inferences from posterior predictive data sets to inferences from the original data. A simulation study demonstrates the utility of these new statistics. The new tests reject the plausibility of inferred tree lengths or topologies more often when data/model combinations produce biased inferences. I also apply this approach to exemplar empirical data sets, highlighting the value of the novel assessments.
Collapse
Affiliation(s)
- Jeremy M Brown
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA
| |
Collapse
|
10
|
Mallatt J, Chittenden KD. The GC content of LSU rRNA evolves across topological and functional regions of the ribosome in all three domains of life. Mol Phylogenet Evol 2014; 72:17-30. [PMID: 24394731 DOI: 10.1016/j.ympev.2013.12.007] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2013] [Revised: 11/28/2013] [Accepted: 12/24/2013] [Indexed: 12/21/2022]
Abstract
Large-subunit rRNA is the ribozyme that catalyzes protein synthesis by translation, and many of its features vary along a deep-to-superficial gradient. By measuring the G+C proportions in this rRNA in all three domains of life (60 bacteria, 379 eukaryote, and 23 archaean sequences), we tested whether the proportion of GC nucleotides varies along this in-out gradient. The rRNA regions used were several zones identified by Bokov and Steinberg (2009) as being arranged from deep to superficial within the LSU. To the Bokov-Steinberg zones, we added the most superficial zone of all, the divergent domains (expansion segments), which are greatly enlarged in eukaryotes. Regression lines constructed from the hundreds of species of organisms revealed the expected in-out gradient, showing that species with high %GC (or high %AT) in their rRNA distribute more of these abundant nucleotides into the peripheral zones. This could be explained by the evolutionary rates of replacement of all nucleotides (A, C, G, T), because these latter rates are fastest at the periphery and slowest near the conserved core. As an overall explanation, we propose that when extrinsic factors (whole-genome nucleotide composition, or environmental temperature) demand the percentage of GC in the rRNA of a species be high or low, then the deep-lying zones are buffered against GC variation because they are the slowest to evolve. The deep, conserved zones are also the most involved in translation, hinting that stabilizing selection there prevents a high GC variability that would diminish LSU rRNA's core functions. We found only a few domain-specific trends in rRNA-GC distribution, which relate to many Archaea living at high temperatures or to the highly complex genes and adaptations of Eukaryota. Use of rRNA sequences in molecular phylogenetic studies, for reconstructing the relationships of organisms across the tree of life, requires accurate models of how rRNA evolves. The demonstration that GC distributes in regular patterns across rRNA regions can improve these tree-reconstruction models in the future and should yield phylogenies of greater accuracy.
Collapse
Affiliation(s)
- Jon Mallatt
- School of Biological Sciences, Washington State University, Pullman, WA 99164-4236, United States.
| | - Kevin D Chittenden
- School of Biological Sciences, Washington State University, Pullman, WA 99164-4236, United States
| |
Collapse
|
11
|
Roquet C, Thuiller W, Lavergne S. Building megaphylogenies for macroecology: taking up the challenge. ECOGRAPHY 2013; 36:13-26. [PMID: 24790290 PMCID: PMC4001083 DOI: 10.1111/j.1600-0587.2012.07773.x] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/05/2023]
Abstract
The last decades have seen an upsurge in ecological studies incorporating phylogenetic information with increasing species samples, motivated by the common conjecture that species with common ancestors should share some ecological characteristics due to niche conservatism. This has been carried out using various methods of increasing complexity and reliability: using only taxonomical classification; constructing supertrees that incorporate only topological information from previously published phylogenies; or building supermatrices of molecular data that are used to estimate phylogenies with evolutionary meaningful branch lengths. Although the latter option is more informative than the others, it remains under-used in ecology because ecologists are generally unaware of or unfamiliar with modern molecular phylogenetic methods. However, a solid phylogenetic hypothesis is necessary to conduct reliable ecological analysis integrating evolutive aspects. Our aim here is to clarify the concepts and methodological issues associated with the reconstruction of dated megaphylogenies, and to show that it is nowadays possible to obtain accurate and well sampled megaphylogenies with informative branch-lengths on large species samples. This is possible thanks to improved phylogenetic methods, vast amounts of molecular data available from databases such as Genbank, and consensus knowledge on deep phylogenetic relationships for an increasing number of groups of organisms. Finally, we include a detailed step-by-step workflow pipeline (Supplementary material), from data acquisition to phylogenetic inference, mainly based on the R environment (widely used by ecologists) and the use of free web-servers, that has been applied to the reconstruction of a species-level phylogeny of all breeding birds of Europe.
Collapse
Affiliation(s)
- Cristina Roquet
- Laboratoire d'Ecologie Alpine, UMR-CNRS 5553, Univ. Joseph Fourier, Grenoble 1, BP 53, FR-38041 Grenoble Cedex 9, France
| | - Wilfried Thuiller
- Laboratoire d'Ecologie Alpine, UMR-CNRS 5553, Univ. Joseph Fourier, Grenoble 1, BP 53, FR-38041 Grenoble Cedex 9, France
| | - Sébastien Lavergne
- Laboratoire d'Ecologie Alpine, UMR-CNRS 5553, Univ. Joseph Fourier, Grenoble 1, BP 53, FR-38041 Grenoble Cedex 9, France
| |
Collapse
|
12
|
Warnow T. Large-Scale Multiple Sequence Alignment and Phylogeny Estimation. MODELS AND ALGORITHMS FOR GENOME EVOLUTION 2013. [DOI: 10.1007/978-1-4471-5298-9_6] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
|
13
|
Wu CH, Suchard MA, Drummond AJ. Bayesian selection of nucleotide substitution models and their site assignments. Mol Biol Evol 2012; 30:669-88. [PMID: 23233462 PMCID: PMC3563969 DOI: 10.1093/molbev/mss258] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Probabilistic inference of a phylogenetic tree from molecular sequence data is predicated on a substitution model describing the relative rates of change between character states along the tree for each site in the multiple sequence alignment. Commonly, one assumes that the substitution model is homogeneous across sites within large partitions of the alignment, assigns these partitions a priori, and then fixes their underlying substitution model to the best-fitting model from a hierarchy of named models. Here, we introduce an automatic model selection and model averaging approach within a Bayesian framework that simultaneously estimates the number of partitions, the assignment of sites to partitions, the substitution model for each partition, and the uncertainty in these selections. This new approach is implemented as an add-on to the BEAST 2 software platform. We find that this approach dramatically improves the fit of the nucleotide substitution model compared with existing approaches, and we show, using a number of example data sets, that as many as nine partitions are required to explain the heterogeneity in nucleotide substitution process across sites in a single gene analysis. In some instances, this improved modeling of the substitution process can have a measurable effect on downstream inference, including the estimated phylogeny, relative divergence times, and effective population size histories.
Collapse
Affiliation(s)
- Chieh-Hsi Wu
- Department of Computer Science, University of Auckland, Auckland, New Zealand
| | | | | |
Collapse
|
14
|
Holland BR, Jarvis PD, Sumner JG. Low-Parameter Phylogenetic Inference Under the General Markov Model. Syst Biol 2012; 62:78-92. [DOI: 10.1093/sysbio/sys072] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Affiliation(s)
- Barbara R. Holland
- School of Mathematics and Physics, University of Tasmania, Hobart 7001, Australia
| | - Peter D. Jarvis
- School of Mathematics and Physics, University of Tasmania, Hobart 7001, Australia
| | - Jeremy G. Sumner
- School of Mathematics and Physics, University of Tasmania, Hobart 7001, Australia
| |
Collapse
|
15
|
Sumner JG, Fernández-Sánchez J, Jarvis PD. Lie Markov models. J Theor Biol 2011; 298:16-31. [PMID: 22212913 DOI: 10.1016/j.jtbi.2011.12.017] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2011] [Revised: 12/15/2011] [Accepted: 12/16/2011] [Indexed: 10/14/2022]
Abstract
Recent work has discussed the importance of multiplicative closure for the Markov models used in phylogenetics. For continuous-time Markov chains, a sufficient condition for multiplicative closure of a model class is ensured by demanding that the set of rate-matrices belonging to the model class form a Lie algebra. It is the case that some well-known Markov models do form Lie algebras and we refer to such models as "Lie Markov models". However it is also the case that some other well-known Markov models unequivocally do not form Lie algebras (GTR being the most conspicuous example). In this paper, we will discuss how to generate Lie Markov models by demanding that the models have certain symmetries under nucleotide permutations. We show that the Lie Markov models include, and hence provide a unifying concept for, "group-based" and "equivariant" models. For each of two and four character states, the full list of Lie Markov models with maximal symmetry is presented and shown to include interesting examples that are neither group-based nor equivariant. We also argue that our scheme is pleasing in the context of applied phylogenetics, as, for a given symmetry of nucleotide substitution, it provides a natural hierarchy of models with increasing number of parameters. We also note that our methods are applicable to any application of continuous-time Markov chains beyond the initial motivations we take from phylogenetics.
Collapse
Affiliation(s)
- J G Sumner
- School of Mathematics and Physics, University of Tasmania, Australia.
| | | | | |
Collapse
|
16
|
Jordan G, Goldman N. The Effects of Alignment Error and Alignment Filtering on the Sitewise Detection of Positive Selection. Mol Biol Evol 2011; 29:1125-39. [DOI: 10.1093/molbev/msr272] [Citation(s) in RCA: 156] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
|
17
|
Addressing inter-gene heterogeneity in maximum likelihood phylogenomic analysis: yeasts revisited. PLoS One 2011; 6:e22783. [PMID: 21850235 PMCID: PMC3151265 DOI: 10.1371/journal.pone.0022783] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2011] [Accepted: 07/05/2011] [Indexed: 11/19/2022] Open
Abstract
Phylogenomic approaches to the resolution of inter-species relationships have become well established in recent years. Often these involve concatenation of many orthologous genes found in the respective genomes followed by analysis using standard phylogenetic models. Genome-scale data promise increased resolution by minimising sampling error, yet are associated with well-known but often inappropriately addressed caveats arising through data heterogeneity and model violation. These can lead to the reconstruction of highly-supported but incorrect topologies. With the aim of obtaining a species tree for 18 species within the ascomycetous yeasts, we have investigated the use of appropriate evolutionary models to address inter-gene heterogeneities and the scalability and validity of supermatrix analysis as the phylogenetic problem becomes more difficult and the number of genes analysed approaches truly phylogenomic dimensions. We have extended a widely-known early phylogenomic study of yeasts by adding additional species to increase diversity and augmenting the number of genes under analysis. We have investigated sophisticated maximum likelihood analyses, considering not only a concatenated version of the data but also partitioned models where each gene constitutes a partition and parameters are free to vary between the different partitions (thereby accounting for variation in the evolutionary processes at different loci). We find considerable increases in likelihood using these complex models, arguing for the need for appropriate models when analyzing phylogenomic data. Using these methods, we were able to reconstruct a well-supported tree for 18 ascomycetous yeasts spanning about 250 million years of evolution.
Collapse
|
18
|
Regier JC, Zwick A. Sources of signal in 62 protein-coding nuclear genes for higher-level phylogenetics of arthropods. PLoS One 2011; 6:e23408. [PMID: 21829732 PMCID: PMC3150433 DOI: 10.1371/journal.pone.0023408] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2011] [Accepted: 07/15/2011] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND This study aims to investigate the strength of various sources of phylogenetic information that led to recent seemingly robust conclusions about higher-level arthropod phylogeny and to assess the role of excluding or downweighting synonymous change for arriving at those conclusions. METHODOLOGY/PRINCIPAL FINDINGS The current study analyzes DNA sequences from 68 gene segments of 62 distinct protein-coding nuclear genes for 80 species. Gene segments analyzed individually support numerous nodes recovered in combined-gene analyses, but few of the higher-level nodes of greatest current interest. However, neither is there support for conflicting alternatives to these higher-level nodes. Gene segments with higher rates of nonsynonymous change tend to be more informative overall, but those with lower rates tend to provide stronger support for deeper nodes. Higher-level nodes with bootstrap values in the 80% - 99% range for the complete data matrix are markedly more sensitive to substantial drops in their bootstrap percentages after character subsampling than those with 100% bootstrap, suggesting that these nodes are likely not to have been strongly supported with many fewer data than in the full matrix. Data set partitioning of total data by (mostly) synonymous and (mostly) nonsynonymous change improves overall node support, but the result remains much inferior to analysis of (unpartitioned) nonsynonymous change alone. Clusters of genes with similar nonsynonymous rate properties (e.g., faster vs. slower) show some distinct patterns of node support but few conflicts. Synonymous change is shown to contribute little, if any, phylogenetic signal to the support of higher-level nodes, but it does contribute nonphylogenetic signal, probably through its underlying heterogeneous nucleotide composition. Analysis of seemingly conservative indels does not prove useful. CONCLUSIONS Generating a robust molecular higher-level phylogeny of Arthropoda is currently possible with large amounts of data and an exclusive reliance on nonsynonymous change.
Collapse
Affiliation(s)
- Jerome C. Regier
- Institute for Bioscience and Biotechnology Research, University of Maryland, College Park, Maryland, United States of America
- Department of Entomology, University of Maryland, College Park, Maryland, United States of America
- Center for Biosystems Research, University of Maryland Biotechnology Institute, College Park, Maryland, United States of America
| | - Andreas Zwick
- Center for Biosystems Research, University of Maryland Biotechnology Institute, College Park, Maryland, United States of America
- Entomology, State Museum of Natural History, Stuttgart, Germany
| |
Collapse
|
19
|
Ané C. Detecting phylogenetic breakpoints and discordance from genome-wide alignments for species tree reconstruction. Genome Biol Evol 2011; 3:246-58. [PMID: 21362638 PMCID: PMC3070431 DOI: 10.1093/gbe/evr013] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
With the easy acquisition of sequence data, it is now possible to obtain and align whole genomes across multiple related species or populations. In this work, I assess the performance of a statistical method to reconstruct the whole distribution of phylogenetic trees along the genome, estimate the proportion of the genome for which a given clade is true, and infer a concordance tree that summarizes the dominant vertical inheritance pattern. There are two main issues when dealing with whole-genome alignments, as opposed to multiple genes: the size of the data and the detection of recombination breakpoints. These breakpoints partition the genomic alignment into phylogenetically homogeneous loci, where sites within a given locus all share the same phylogenetic tree topology. To delimitate these loci, I describe here a method based on the minimum description length (MDL) principle, implemented with dynamic programming for computational efficiency. Simulations show that combining MDL partitioning with Bayesian concordance analysis provides an efficient and robust way to estimate both the vertical inheritance signal and the horizontal phylogenetic signal. The method performed well both in the presence of incomplete lineage sorting and in the presence of horizontal gene transfer. A high level of systematic bias was found here, highlighting the need for good individual tree building methods, which form the basis for more elaborate gene tree/species tree reconciliation methods.
Collapse
Affiliation(s)
- Cécile Ané
- Departments of Statistics and Botany, University of Wisconsin-Madison, USA.
| |
Collapse
|
20
|
Wang HC, Susko E, Roger AJ. Fast statistical tests for detecting heterotachy in protein evolution. Mol Biol Evol 2011; 28:2305-15. [PMID: 21343603 DOI: 10.1093/molbev/msr050] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
The w statistic introduced by Lockhart et al. (1998. A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages. Mol Biol Evol. 15:1183-1188) is a simple and easily calculated statistic intended to detect heterotachy by comparing amino acid substitution patterns between two monophyletic groups of protein sequences. It is defined as the difference between the fraction of varied sites in both groups and the fraction of varied sites in each group. The w test has been used to distinguish a covarion process from equal rates and rates variation across sites processes. Using simulation we show that the w test is effective for small data sets and for data sets that have low substitution rates in the groups but can have difficulties when these conditions are not met. Using site entropy as a measure of variability of a sequence site, we modify the w statistic to a w' statistic by assigning as varied in one group those sites that are actually varied in both groups but have a large entropy difference. We show that the w' test has more power to detect two kinds of heterotachy processes (covarion and bivariate rate shifts) in large and variable data. We also show that a test of Pearson's correlation of the site entropies between two monophyletic groups can be used to detect heterotachy and has more power than the w' test. Furthermore, we demonstrate that there are settings where the correlation test as well as w and w' tests do not detect heterotachy signals in data simulated under a branch length mixture model. In such cases, it is sometimes possible to detect heterotachy through subselection of appropriate taxa. Finally, we discuss the abilities of the three statistical tests to detect a fourth mode of heterotachy: lineage-specific changes in proportion of variable sites.
Collapse
Affiliation(s)
- Huai-Chun Wang
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, Canada.
| | | | | |
Collapse
|
21
|
den Bakker HC, Bundrant BN, Fortes ED, Orsi RH, Wiedmann M. A population genetics-based and phylogenetic approach to understanding the evolution of virulence in the genus Listeria. Appl Environ Microbiol 2010; 76:6085-100. [PMID: 20656873 PMCID: PMC2937515 DOI: 10.1128/aem.00447-10] [Citation(s) in RCA: 67] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2010] [Accepted: 07/12/2010] [Indexed: 11/20/2022] Open
Abstract
The genus Listeria includes (i) the opportunistic pathogens L. monocytogenes and L. ivanovii, (ii) the saprotrophs L. innocua, L. marthii, and L. welshimeri, and (iii) L. seeligeri, an apparent saprotroph that nevertheless typically contains the prfA virulence gene cluster. A novel 10-loci multilocus sequence typing scheme was developed and used to characterize 67 isolates representing six Listeria spp. (excluding L. grayi) in order to (i) provide an improved understanding of the phylogeny and evolution of the genus Listeria and (ii) use Listeria as a model to study the evolution of pathogenicity in opportunistic environmental pathogens. Phylogenetic analyses identified six well-supported Listeria species that group into two main subdivisions, with each subdivision containing strains with and without the prfA virulence gene cluster. Stochastic character mapping and phylogenetic analysis of hly, a gene in the prfA cluster, suggest that the common ancestor of the genus Listeria contained the prfA virulence gene cluster and that this cluster was lost at least five times during the evolution of Listeria, yielding multiple distinct saprotrophic clades. L. welshimeri, which appears to represent the most ancient clade that arose from an ancestor with a prfA cluster deletion, shows a considerably lower average sequence divergence than other Listeria species, suggesting a population bottleneck and a putatively different ecology than other saprotrophic Listeria species. Overall, our data suggest that, for some pathogens, loss of virulence genes may represent a selective advantage, possibly by facilitating adaptation to a specific ecological niche.
Collapse
Affiliation(s)
- Henk C den Bakker
- Department of Food Science, Cornell University, Ithaca, New York 14853, USA.
| | | | | | | | | |
Collapse
|
22
|
Whelan S, Blackburne BP, Spencer M. Phylogenetic substitution models for detecting heterotachy during plastid evolution. Mol Biol Evol 2010; 28:449-58. [PMID: 20724379 DOI: 10.1093/molbev/msq215] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
There is widespread evidence of lineage-specific rate variation, known as heterotachy, during protein evolution. Changes in the structural and functional constraints acting on a protein can lead to heterotachy, and it is plausible that such changes, known as covarion shifts, may affect many amino acids at once. Several previous attempts to model heterotachy have used covarion models, where the sequence undergoes covarion drift, whereby each site may switch independently among a set of discrete classes having different substitution rates. However, such independent switching may not capture biologically important events where the selective forces acting on a protein affect many sites at once. We describe a new class of models that allow the rates of substitution and switching to vary among branches of a phylogenetic tree. Such models are better able to handle covarion shifts. We apply these models to a set of genes occurring in nonphotosynthetic bacteria, cyanobacteria, and the plastids of green and red algae. We find that 4/5 genes show evidence of some form of rate switching and that 3/5 genes show evidence that the relative switching rate differs among taxonomic groups. We conclude that covarion shifts may be frequent during the deep evolution of plastid genes and that our methodology may provide a powerful new tool for investigating such shifts in other systems.
Collapse
Affiliation(s)
- Simon Whelan
- Computational and Evolutionary Biology, Faculty of Life Sciences, University of Manchester, Manchester, United Kingdom.
| | | | | |
Collapse
|
23
|
Stöver BC, Müller KF. TreeGraph 2: combining and visualizing evidence from different phylogenetic analyses. BMC Bioinformatics 2010; 11:7. [PMID: 20051126 PMCID: PMC2806359 DOI: 10.1186/1471-2105-11-7] [Citation(s) in RCA: 895] [Impact Index Per Article: 63.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2009] [Accepted: 01/05/2010] [Indexed: 01/22/2023] Open
Abstract
Background Today it is common to apply multiple potentially conflicting data sources to a given phylogenetic problem. At the same time, several different inference techniques are routinely employed instead of relying on just one. In view of both trends it is becoming increasingly important to be able to efficiently compare different sets of statistical values supporting (or conflicting with) the nodes of a given tree topology, and merging this into a meaningful representation. A tree editor supporting this should also allow for flexible editing operations and be able to produce ready-to-publish figures. Results We developed TreeGraph 2, a GUI-based graphical editor for phylogenetic trees (available from http://treegraph.bioinfweb.info). It allows automatically combining information from different phylogenetic analyses of a given dataset (or from different subsets of the dataset), and helps to identify and graphically present incongruences. The program features versatile editing and formatting options, such as automatically setting line widths or colors according to the value of any of the unlimited number of variables that can be assigned to each node or branch. These node/branch data can be imported from spread sheets or other trees, be calculated from each other by specified mathematical expressions, filtered, copied from and to other internal variables, be kept invisible or set visible and then be freely formatted (individually or across the whole tree). Beyond typical editing operations such as tree rerooting and ladderizing or moving and collapsing of nodes, whole clades can be copied from other files and be inserted (along with all node/branch data and legends), but can also be manually added and, thus, whole trees can quickly be manually constructed de novo. TreeGraph 2 outputs various graphic formats such as SVG, PDF, or PNG, useful for tree figures in both publications and presentations. Conclusion TreeGraph 2 is a user-friendly, fully documented application to produce ready-to-publish trees. It can display any number of annotations in several ways, and permits easily importing and combining them. Additionally, a great number of editing- and formatting-operations is available.
Collapse
Affiliation(s)
- Ben C Stöver
- Institute for Evolution and Biodiversity, University of Münster, Hüfferstrasse 1, 48149 Münster, Germany.
| | | |
Collapse
|
24
|
Wang HC, Susko E, Roger AJ. PROCOV: maximum likelihood estimation of protein phylogeny under covarion models and site-specific covarion pattern analysis. BMC Evol Biol 2009; 9:225. [PMID: 19737395 PMCID: PMC2758850 DOI: 10.1186/1471-2148-9-225] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2009] [Accepted: 09/08/2009] [Indexed: 11/12/2022] Open
Abstract
Background The covarion hypothesis of molecular evolution holds that selective pressures on a given amino acid or nucleotide site are dependent on the identity of other sites in the molecule that change throughout time, resulting in changes of evolutionary rates of sites along the branches of a phylogenetic tree. At the sequence level, covarion-like evolution at a site manifests as conservation of nucleotide or amino acid states among some homologs where the states are not conserved in other homologs (or groups of homologs). Covarion-like evolution has been shown to relate to changes in functions at sites in different clades, and, if ignored, can adversely affect the accuracy of phylogenetic inference. Results PROCOV (protein covarion analysis) is a software tool that implements a number of previously proposed covarion models of protein evolution for phylogenetic inference in a maximum likelihood framework. Several algorithmic and implementation improvements in this tool over previous versions make computationally expensive tree searches with covarion models more efficient and analyses of large phylogenomic data sets tractable. PROCOV can be used to identify covarion sites by comparing the site likelihoods under the covarion process to the corresponding site likelihoods under a rates-across-sites (RAS) process. Those sites with the greatest log-likelihood difference between a 'covarion' and an RAS process were found to be of functional or structural significance in a dataset of bacterial and eukaryotic elongation factors. Conclusion Covarion models implemented in PROCOV may be especially useful for phylogenetic estimation when ancient divergences between sequences have occurred and rates of evolution at sites are likely to have changed over the tree. It can also be used to study lineage-specific functional shifts in protein families that result in changes in the patterns of site variability among subtrees.
Collapse
Affiliation(s)
- Huai-Chun Wang
- Department of Mathematics and Statistics, Dalhousie University, Halifax, NS, Canada.
| | | | | |
Collapse
|
25
|
Abstract
Heterotachy is a general term to describe positions in a sequence that evolve at different rates in different lineages. Kolaczkowski and Thornton (2004. Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature 431:980-984.) recently described an intriguing heterotachy model that leads to topological bias for likelihood-based methods and parsimony methods. In this article, we show that heterotachy can generally be viewed as multivariate rates-across-sites variation, which can be described as randomly drawing rates (or branch lengths) from a multivariate distribution for each branch at each site. Motivated by this idea, we propose a pairwise alpha heterotachy adjustment model, which gives us much improved topological estimation in the settings by Kolaczkowski and Thornton (2004).
Collapse
Affiliation(s)
- Jihua Wu
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, Canada.
| | | |
Collapse
|
26
|
Whelan S. The genetic code can cause systematic bias in simple phylogenetic models. Philos Trans R Soc Lond B Biol Sci 2009; 363:4003-11. [PMID: 18852102 DOI: 10.1098/rstb.2008.0171] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Phylogenetic analysis depends on inferential methodology estimating accurately the degree of divergence between sequences. Inaccurate estimates can lead to misleading evolutionary inferences, including incorrect tree topology estimates and poor dating of historical species divergence. Protein coding sequences are ubiquitous in phylogenetic inference, but many of the standard methods commonly used to describe their evolution do not explicitly account for the dependencies between sites in a codon induced by the genetic code. This study evaluates the performance of several standard methods on datasets simulated under a simple substitution model, describing codon evolution under a range of different types of selective pressures. This approach also offers insights into the relative performance of different phylogenetic methods when there are dependencies acting between the sites in the data. Methods based on statistical models performed well when there was no or limited purifying selection in the simulated sequences (low degree of dependency between sites in a codon), although more biologically realistic models tended to outperform simpler models. Phylogenetic methods exhibited greater variability in performance for sequences simulated under strong purifying selection (high degree of the dependencies between sites in a codon). Simple models substantially underestimate the degree of divergence between sequences, and underestimation was more pronounced on the internal branches of the tree. This underestimation resulted in some statistical methods performing poorly and exhibiting evidence for systematic bias in tree inference. Amino acid-based and nucleotide models that contained generic descriptions of spatial and temporal heterogeneity, such as mixture and temporal hidden Markov models, coped notably better, producing more accurate estimates of evolutionary divergence and the tree topology.
Collapse
Affiliation(s)
- Simon Whelan
- Faculty of Life Sciences, University of Manchester, Michael Smith Building, Oxford Road, Manchester M13 9PT, UK.
| |
Collapse
|
27
|
Spencer M, Sangaralingam A. A phylogenetic mixture model for gene family loss in parasitic bacteria. Mol Biol Evol 2009; 26:1901-8. [PMID: 19435739 DOI: 10.1093/molbev/msp102] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Gene families are frequently gained and lost from prokaryotic genomes. It is widely believed that the rate of loss was accelerated for some but not all gene families in lineages that became parasites or endosymbionts. This leads to a form of heterotachy that may be responsible for the poor performance of phylogeny estimation based on gene content. We describe a mixture model that accounts for this heterotachy. We show that this model fits data on the distribution of gene families across bacteria from the COG database much better than previous models. However, it still favors an artifactual tree topology in which parasites form a clade over the more plausible 16S topology. In contrast to a previous model of genome dynamics, our model suggests that the ancestral bacterium had a small genome. We suggest that models of gene family gain and loss are likely to be more useful for understanding genome dynamics than for estimating phylogenetic trees.
Collapse
Affiliation(s)
- Matthew Spencer
- School of Biological Sciences, University of Liverpool, Liverpool, UK.
| | | |
Collapse
|