1
|
Peyretaillade E, Akossi RF, Tournayre J, Delbac F, Wawrzyniak I. How to overcome constraints imposed by microsporidian genome features to ensure gene prediction? J Eukaryot Microbiol 2024; 71:e13038. [PMID: 38934348 DOI: 10.1111/jeu.13038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2024] [Revised: 06/03/2024] [Accepted: 06/10/2024] [Indexed: 06/28/2024]
Abstract
Since the advent of sequencing techniques and due to their continuous evolution, it has become easier and less expensive to obtain the complete genome sequence of any organism. Nevertheless, to elucidate all biological processes governing organism development, quality annotation is essential. In genome annotation, predicting gene structure is one of the most important and captivating challenges for computational biology. This aspect of annotation requires continual optimization, particularly for genomes as unusual as those of microsporidia. Indeed, this group of fungal-related parasites exhibits specific features (highly reduced gene sizes, sequences with high rate of evolution) linked to their evolution as intracellular parasites, requiring the implementation of specific annotation approaches to consider all these features. This review aimed to outline these characteristics and to assess the increasingly efficient approaches and tools that have enhanced the accuracy of gene prediction for microsporidia, both in terms of sensitivity and specificity. Subsequently, a final part will be dedicated to postgenomic approaches aimed at reinforcing the annotation data generated by prediction software. These approaches include the characterization of other understudied genes, such as those encoding regulatory noncoding RNAs or very small proteins, which also play crucial roles in the life cycle of these microorganisms.
Collapse
Affiliation(s)
| | - Reginal F Akossi
- LMGE, CNRS, Université Clermont Auvergne, Clermont-Ferrand, France
| | - Jérémy Tournayre
- INRAE, UMR Herbivores, Université Clermont Auvergne, VetAgro Sup, Saint-Genès-Champanelle, France
| | - Frédéric Delbac
- LMGE, CNRS, Université Clermont Auvergne, Clermont-Ferrand, France
| | - Ivan Wawrzyniak
- LMGE, CNRS, Université Clermont Auvergne, Clermont-Ferrand, France
| |
Collapse
|
2
|
Williams TA, Schrempf D, Szöllősi GJ, Cox CJ, Foster PG, Embley TM. Inferring the deep past from molecular data. Genome Biol Evol 2021; 13:6192802. [PMID: 33772552 PMCID: PMC8175050 DOI: 10.1093/gbe/evab067] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/22/2021] [Indexed: 12/17/2022] Open
Abstract
There is an expectation that analyses of molecular sequences might be able to distinguish between alternative hypotheses for ancient relationships, but the phylogenetic methods used and types of data analyzed are of critical importance in any attempt to recover historical signal. Here, we discuss some common issues that can influence the topology of trees obtained when using overly simple models to analyze molecular data that often display complicated patterns of sequence heterogeneity. To illustrate our discussion, we have used three examples of inferred relationships which have changed radically as models and methods of analysis have improved. In two of these examples, the sister-group relationship between thermophilic Thermus and mesophilic Deinococcus, and the position of long-branch Microsporidia among eukaryotes, we show that recovering what is now generally considered to be the correct tree is critically dependent on the fit between model and data. In the third example, the position of eukaryotes in the tree of life, the hypothesis that is currently supported by the best available methods is fundamentally different from the classical view of relationships between major cellular domains. Since heterogeneity appears to be pervasive and varied among all molecular sequence data, and even the best available models can still struggle to deal with some problems, the issues we discuss are generally relevant to phylogenetic analyses. It remains essential to maintain a critical attitude to all trees as hypotheses of relationship that may change with more data and better methods.
Collapse
Affiliation(s)
- Tom A Williams
- School of Biological Sciences, University of Bristol, Bristol BS8 1TQ, United Kingdom
| | - Dominik Schrempf
- Dept. of Biological Physics, Eötvös Loránd University, 1117 Budapest, Hungary
| | - Gergely J Szöllősi
- Dept. of Biological Physics, Eötvös Loránd University, 1117 Budapest, Hungary.,MTA-ELTE "Lendület" Evolutionary Genomics Research Group, 1117 Budapest, Hungary.,Institute of Evolution, Centre for Ecological Research, 1121 Budapest, Hungary
| | - Cymon J Cox
- Centro de Ciências do Mar, Universidade do Algarve, Gambelas, 8005-319 Faro, Portugal
| | - Peter G Foster
- Department of Life Sciences, Natural History Museum, London SW7 5BD, United Kingdom
| | - T Martin Embley
- Biosciences Institute, Centre for Bacterial Cell Biology, Newcastle University, Newcastle upon Tyne NE2 4AX, United Kingdom
| |
Collapse
|
3
|
Goremykin V. A Novel Test for Absolute Fit of Evolutionary Models Provides a Means to Correctly Identify the Substitution Model and the Model Tree. Genome Biol Evol 2020; 11:2403-2419. [PMID: 31368483 PMCID: PMC6736042 DOI: 10.1093/gbe/evz167] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/29/2019] [Indexed: 02/07/2023] Open
Abstract
A novel test is described that visualizes the absolute model-data fit of the substitution and tree components of an evolutionary model. The test utilizes statistics based on counts of character state matches and mismatches in alignments of observed and simulated sequences. This comparison is used to assess model-data fit. In simulations conducted to evaluate the performance of the test, the test estimator was able to identify both the correct tree topology and substitution model under conditions where the Goldman-Cox test-which tests the fit of a substitution model to sequence data and is also based on comparing simulated replicates with observed data-showed high error rates. The novel test was found to identify the correct tree topology within a wide range of DNA substitution model misspecifications, indicating the high discriminatory power of the test. Use of this test provides a practical approach for assessing absolute model-data fit when testing phylogenetic hypotheses.
Collapse
Affiliation(s)
- Vadim Goremykin
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all'Adige, Trentino, Italy
| |
Collapse
|
4
|
Leger MM, Eme L, Stairs CW, Roger AJ. Demystifying Eukaryote Lateral Gene Transfer (Response to Martin 2017 DOI: 10.1002/bies.201700115). Bioessays 2018; 40:e1700242. [DOI: 10.1002/bies.201700242] [Citation(s) in RCA: 54] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2017] [Revised: 02/06/2018] [Indexed: 12/28/2022]
Affiliation(s)
- Michelle M. Leger
- Institute of Evolutionary Biology (CSIC-UPF); Pg. Marítim de la Barceloneta, Barcelona ES 08003 Spain
| | - Laura Eme
- Department of Cell and Molecular Biology; Science for Life Laboratory; Uppsala University; Box 596, Uppsala SE 751 25 Sweden
| | - Courtney W. Stairs
- Department of Cell and Molecular Biology; Science for Life Laboratory; Uppsala University; Box 596, Uppsala SE 751 25 Sweden
| | - Andrew J. Roger
- Centre for Comparative Genomics and Evolutionary Bioinformatics; Department of Biochemistry and Molecular Biology; Dalhousie University; P.O. Box 15000, Halifax CAN B3H 4R2 Nova Scotia Canada
| |
Collapse
|
5
|
|
6
|
Abstract
Molecular evolution can reveal the relationship between sets of homologous sequences and the patterns of change that occur during their evolution. An important aspect of these studies is the inference of a phylogenetic tree, which explicitly describes evolutionary relationships between homologous sequences. This chapter provides an introduction to evolutionary trees and how to infer them from sequence data using some commonly used inferential methodology. It focuses on statistical methods for inferring trees and how to assess the confidence one should have in any resulting tree, with a particular emphasis on the underlying assumptions of the methods and how they might affect the tree estimate. There is also some discussion of the underlying algorithms used to perform tree search and recommendations regarding the performance of different algorithms. Finally, there are a few practical guidelines, including how to combine multiple software packages to improve inference, and a comparison between Bayesian and Maximum likelihood phylogenetics.
Collapse
Affiliation(s)
- Simon Whelan
- Department of Ecology and Genetics, Uppsala University, Uppsala, Sweden.
| | - David A Morrison
- Department of Organism Biology, Uppsala University, Uppsala, Sweden
| |
Collapse
|
7
|
Abstract
The tree of life is one of the most important organizing principles in biology(1). Gene surveys suggest the existence of an enormous number of branches(2), but even an approximation of the full scale of the tree has remained elusive. Recent depictions of the tree of life have focused either on the nature of deep evolutionary relationships(3-5) or on the known, well-classified diversity of life with an emphasis on eukaryotes(6). These approaches overlook the dramatic change in our understanding of life's diversity resulting from genomic sampling of previously unexamined environments. New methods to generate genome sequences illuminate the identity of organisms and their metabolic capacities, placing them in community and ecosystem contexts(7,8). Here, we use new genomic data from over 1,000 uncultivated and little known organisms, together with published sequences, to infer a dramatically expanded version of the tree of life, with Bacteria, Archaea and Eukarya included. The depiction is both a global overview and a snapshot of the diversity within each major lineage. The results reveal the dominance of bacterial diversification and underline the importance of organisms lacking isolated representatives, with substantial evolution concentrated in a major radiation of such organisms. This tree highlights major lineages currently underrepresented in biogeochemical models and identifies radiations that are probably important for future evolutionary analyses.
Collapse
|
8
|
|
9
|
Kück P, Wägele JW. Plesiomorphic character states cause systematic errors in molecular phylogenetic analyses: a simulation study. Cladistics 2015; 32:461-478. [DOI: 10.1111/cla.12132] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/15/2015] [Indexed: 01/17/2023] Open
Affiliation(s)
- Patrick Kück
- The Natural History Museum Cromwell Road SW7 5BD London UK
| | - J. Wolfgang Wägele
- Zoologisches Forschungsmuseum Alexander Koenig Adenauerallee 160 53113 Bonn Germany
| |
Collapse
|
10
|
Affiliation(s)
- Edward Susko
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, canada B3H, 4R2
| |
Collapse
|
11
|
Parks SL, Goldman N. Maximum likelihood inference of small trees in the presence of long branches. Syst Biol 2014; 63:798-811. [PMID: 24996414 PMCID: PMC6371681 DOI: 10.1093/sysbio/syu044] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2013] [Accepted: 06/20/2014] [Indexed: 11/14/2022] Open
Abstract
The statistical basis of maximum likelihood (ML), its robustness, and the fact that it appears to suffer less from biases lead to it being one of the most popular methods for tree reconstruction. Despite its popularity, very few analytical solutions for ML exist, so biases suffered by ML are not well understood. One possible bias is long branch attraction (LBA), a regularly cited term generally used to describe a propensity for long branches to be joined together in estimated trees. Although initially mentioned in connection with inconsistency of parsimony, LBA has been claimed to affect all major phylogenetic reconstruction methods, including ML. Despite the widespread use of this term in the literature, exactly what LBA is and what may be causing it is poorly understood, even for simple evolutionary models and small model trees. Studies looking at LBA have focused on the effect of two long branches on tree reconstruction. However, to understand the effect of two long branches it is also important to understand the effect of just one long branch. If ML struggles to reconstruct one long branch, then this may have an impact on LBA. In this study, we look at the effect of one long branch on three-taxon tree reconstruction. We show that, counterintuitively, long branches are preferentially placed at the tips of the tree. This can be understood through the use of analytical solutions to the ML equation and distance matrix methods. We go on to look at the placement of two long branches on four-taxon trees, showing that there is no attraction between long branches, but that for extreme branch lengths long branches are joined together disproportionally often. These results illustrate that even small model trees are still interesting to help understand how ML phylogenetic reconstruction works, and that LBA is a complicated phenomenon that deserves further study.
Collapse
Affiliation(s)
- Sarah L Parks
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, CB10 1SD, United Kingdom
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, CB10 1SD, United Kingdom
| |
Collapse
|
12
|
Som A. Causes, consequences and solutions of phylogenetic incongruence. Brief Bioinform 2014; 16:536-48. [PMID: 24872401 DOI: 10.1093/bib/bbu015] [Citation(s) in RCA: 91] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2014] [Accepted: 04/05/2014] [Indexed: 11/14/2022] Open
Abstract
Phylogenetic analysis is used to recover the evolutionary history of species, genes or proteins. Understanding phylogenetic relationships between organisms is a prerequisite of almost any evolutionary study, as contemporary species all share a common history through their ancestry. Moreover, it is important because of its wide applications that include understanding genome organization, epidemiological investigations, predicting protein functions, and deciding the genes to be analyzed in comparative studies. Despite immense progress in recent years, phylogenetic reconstruction involves many challenges that create uncertainty with respect to the true evolutionary relationships of the species or genes analyzed. One of the most notable difficulties is the widespread occurrence of incongruence among methods and also among individual genes or different genomic regions. Presence of widespread incongruence inhibits successful revealing of evolutionary relationships and applications of phylogenetic analysis. In this article, I concisely review the effect of various factors that cause incongruence in molecular phylogenies, the advances in the field that resolved some factors, and explore unresolved factors that cause incongruence along with possible ways for tackling them.
Collapse
|
13
|
Wu J, Hasegawa M, Zhong Y, Yonezawa T. Importance of synonymous substitutions under dense taxon sampling and appropriate modeling in reconstructing the mitogenomic tree of Eutheria. Genes Genet Syst 2014; 89:237-51. [DOI: 10.1266/ggs.89.237] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Affiliation(s)
- Jiaqi Wu
- School of Life Sciences, Fudan University
| | - Masami Hasegawa
- The Institute of Statistical Mathematics
- School of Life Sciences, Fudan University
| | - Yang Zhong
- Institute of Biodiversity Science and Geobiology, Tibet University
- School of Life Sciences, Fudan University
| | | |
Collapse
|
14
|
Using multiple analytical methods to improve phylogenetic hypotheses in Minaria (Apocynaceae). Mol Phylogenet Evol 2012; 65:915-25. [PMID: 22982434 DOI: 10.1016/j.ympev.2012.08.019] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2011] [Revised: 07/25/2012] [Accepted: 08/17/2012] [Indexed: 11/23/2022]
Abstract
Metastelmatinae is a neotropical subtribe of Asclepiadoideae (Apocynaceae), comprising 13 genera and around 260 species whose phylogenetic relationships are often unresolved or incongruent between plastid and nuclear datasets. The genus Minaria is one of the first lineages to emerge in the Metastelmatinae and is highly supported based on plastid markers. It comprises 21 species, most of which are endemic to small areas with open vegetation in the Espinhaço Range, Brazil. In the work presented here, we use plastid (rps16, trnH-psbA, trnS-trnG, and trnD-trnT) and nuclear (ITS and ETS) datasets to investigate the relationships within Minaria. We show that the three methods mostly used in phylogenetic studies, namely, maximum parsimony, maximum likelihood, and Bayesian Inference, have different performances and that a pluralistic analytical approach combining results from them can increase tree resolution and clade confidence, providing valuable phylogenetic information.
Collapse
|
15
|
Müller M, Mentel M, van Hellemond JJ, Henze K, Woehle C, Gould SB, Yu RY, van der Giezen M, Tielens AGM, Martin WF. Biochemistry and evolution of anaerobic energy metabolism in eukaryotes. Microbiol Mol Biol Rev 2012; 76:444-95. [PMID: 22688819 PMCID: PMC3372258 DOI: 10.1128/mmbr.05024-11] [Citation(s) in RCA: 517] [Impact Index Per Article: 39.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Major insights into the phylogenetic distribution, biochemistry, and evolutionary significance of organelles involved in ATP synthesis (energy metabolism) in eukaryotes that thrive in anaerobic environments for all or part of their life cycles have accrued in recent years. All known eukaryotic groups possess an organelle of mitochondrial origin, mapping the origin of mitochondria to the eukaryotic common ancestor, and genome sequence data are rapidly accumulating for eukaryotes that possess anaerobic mitochondria, hydrogenosomes, or mitosomes. Here we review the available biochemical data on the enzymes and pathways that eukaryotes use in anaerobic energy metabolism and summarize the metabolic end products that they generate in their anaerobic habitats, focusing on the biochemical roles that their mitochondria play in anaerobic ATP synthesis. We present metabolic maps of compartmentalized energy metabolism for 16 well-studied species. There are currently no enzymes of core anaerobic energy metabolism that are specific to any of the six eukaryotic supergroup lineages; genes present in one supergroup are also found in at least one other supergroup. The gene distribution across lineages thus reflects the presence of anaerobic energy metabolism in the eukaryote common ancestor and differential loss during the specialization of some lineages to oxic niches, just as oxphos capabilities have been differentially lost in specialization to anoxic niches and the parasitic life-style. Some facultative anaerobes have retained both aerobic and anaerobic pathways. Diversified eukaryotic lineages have retained the same enzymes of anaerobic ATP synthesis, in line with geochemical data indicating low environmental oxygen levels while eukaryotes arose and diversified.
Collapse
Affiliation(s)
| | - Marek Mentel
- Department of Biochemistry, Faculty of Natural Sciences, Comenius University, Bratislava, Slovakia
| | - Jaap J. van Hellemond
- Department of Medical Microbiology and Infectious Diseases, Erasmus University Medical Center, Rotterdam, Netherlands
| | - Katrin Henze
- Institute of Molecular Evolution, University of Düsseldorf, Düsseldorf, Germany
| | - Christian Woehle
- Institute of Molecular Evolution, University of Düsseldorf, Düsseldorf, Germany
| | - Sven B. Gould
- Institute of Molecular Evolution, University of Düsseldorf, Düsseldorf, Germany
| | - Re-Young Yu
- Institute of Molecular Evolution, University of Düsseldorf, Düsseldorf, Germany
| | - Mark van der Giezen
- Biosciences, College of Life and Environmental Sciences, University of Exeter, Exeter, United Kingdom
| | - Aloysius G. M. Tielens
- Department of Medical Microbiology and Infectious Diseases, Erasmus University Medical Center, Rotterdam, Netherlands
| | - William F. Martin
- Institute of Molecular Evolution, University of Düsseldorf, Düsseldorf, Germany
| |
Collapse
|
16
|
|
17
|
Williams D, Fournier GP, Lapierre P, Swithers KS, Green AG, Andam CP, Gogarten JP. A rooted net of life. Biol Direct 2011; 6:45. [PMID: 21936906 PMCID: PMC3189188 DOI: 10.1186/1745-6150-6-45] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2011] [Accepted: 09/21/2011] [Indexed: 01/29/2023] Open
Abstract
Abstract Phylogenetic reconstruction using DNA and protein sequences has allowed the reconstruction of evolutionary histories encompassing all life. We present and discuss a means to incorporate much of this rich narrative into a single model that acknowledges the discrete evolutionary units that constitute the organism. Briefly, this Rooted Net of Life genome phylogeny is constructed around an initial, well resolved and rooted tree scaffold inferred from a supermatrix of combined ribosomal genes. Extant sampled ribosomes form the leaves of the tree scaffold. These leaves, but not necessarily the deeper parts of the scaffold, can be considered to represent a genome or pan-genome, and to be associated with members of other gene families within that sequenced (pan)genome. Unrooted phylogenies of gene families containing four or more members are reconstructed and superimposed over the scaffold. Initially, reticulations are formed where incongruities between topologies exist. Given sufficient evidence, edges may then be differentiated as those representing vertical lines of inheritance within lineages and those representing horizontal genetic transfers or endosymbioses between lineages. Reviewers W. Ford Doolittle, Eric Bapteste and Robert Beiko.
Collapse
Affiliation(s)
- David Williams
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT 06269-3125, USA.
| | | | | | | | | | | | | |
Collapse
|
18
|
Wang HC, Susko E, Roger AJ. Fast statistical tests for detecting heterotachy in protein evolution. Mol Biol Evol 2011; 28:2305-15. [PMID: 21343603 DOI: 10.1093/molbev/msr050] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
The w statistic introduced by Lockhart et al. (1998. A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages. Mol Biol Evol. 15:1183-1188) is a simple and easily calculated statistic intended to detect heterotachy by comparing amino acid substitution patterns between two monophyletic groups of protein sequences. It is defined as the difference between the fraction of varied sites in both groups and the fraction of varied sites in each group. The w test has been used to distinguish a covarion process from equal rates and rates variation across sites processes. Using simulation we show that the w test is effective for small data sets and for data sets that have low substitution rates in the groups but can have difficulties when these conditions are not met. Using site entropy as a measure of variability of a sequence site, we modify the w statistic to a w' statistic by assigning as varied in one group those sites that are actually varied in both groups but have a large entropy difference. We show that the w' test has more power to detect two kinds of heterotachy processes (covarion and bivariate rate shifts) in large and variable data. We also show that a test of Pearson's correlation of the site entropies between two monophyletic groups can be used to detect heterotachy and has more power than the w' test. Furthermore, we demonstrate that there are settings where the correlation test as well as w and w' tests do not detect heterotachy signals in data simulated under a branch length mixture model. In such cases, it is sometimes possible to detect heterotachy through subselection of appropriate taxa. Finally, we discuss the abilities of the three statistical tests to detect a fourth mode of heterotachy: lineage-specific changes in proportion of variable sites.
Collapse
Affiliation(s)
- Huai-Chun Wang
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, Canada.
| | | | | |
Collapse
|
19
|
Roure B, Philippe H. Site-specific time heterogeneity of the substitution process and its impact on phylogenetic inference. BMC Evol Biol 2011; 11:17. [PMID: 21235782 PMCID: PMC3034684 DOI: 10.1186/1471-2148-11-17] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2010] [Accepted: 01/14/2011] [Indexed: 11/13/2022] Open
Abstract
Background Model violations constitute the major limitation in inferring accurate phylogenies. Characterizing properties of the data that are not being correctly handled by current models is therefore of prime importance. One of the properties of protein evolution is the variation of the relative rate of substitutions across sites and over time, the latter is the phenomenon called heterotachy. Its effect on phylogenetic inference has recently obtained considerable attention, which led to the development of new models of sequence evolution. However, thus far focus has been on the quantitative heterogeneity of the evolutionary process, thereby overlooking more qualitative variations. Results We studied the importance of variation of the site-specific amino-acid substitution process over time and its possible impact on phylogenetic inference. We used the CAT model to define an infinite mixture of substitution processes characterized by equilibrium frequencies over the twenty amino acids, a useful proxy for qualitatively estimating the evolutionary process. Using two large datasets, we show that qualitative changes in site-specific substitution properties over time occurred significantly. To test whether this unaccounted qualitative variation can lead to an erroneous phylogenetic tree, we analyzed a concatenation of mitochondrial proteins in which Cnidaria and Porifera were erroneously grouped. The progressive removal of the sites with the most heterogeneous CAT profiles across clades led to the recovery of the monophyly of Eumetazoa (Cnidaria+Bilateria), suggesting that this heterogeneity can negatively influence phylogenetic inference. Conclusion The time-heterogeneity of the amino-acid replacement process is therefore an important evolutionary aspect that should be incorporated in future models of sequence change.
Collapse
Affiliation(s)
- Béatrice Roure
- Département de Biochimie, Centre Robert-Cedergren, Université de Montréal, Succursale Centre-Ville, Québec, Canada
| | | |
Collapse
|
20
|
Whelan S, Blackburne BP, Spencer M. Phylogenetic substitution models for detecting heterotachy during plastid evolution. Mol Biol Evol 2010; 28:449-58. [PMID: 20724379 DOI: 10.1093/molbev/msq215] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
There is widespread evidence of lineage-specific rate variation, known as heterotachy, during protein evolution. Changes in the structural and functional constraints acting on a protein can lead to heterotachy, and it is plausible that such changes, known as covarion shifts, may affect many amino acids at once. Several previous attempts to model heterotachy have used covarion models, where the sequence undergoes covarion drift, whereby each site may switch independently among a set of discrete classes having different substitution rates. However, such independent switching may not capture biologically important events where the selective forces acting on a protein affect many sites at once. We describe a new class of models that allow the rates of substitution and switching to vary among branches of a phylogenetic tree. Such models are better able to handle covarion shifts. We apply these models to a set of genes occurring in nonphotosynthetic bacteria, cyanobacteria, and the plastids of green and red algae. We find that 4/5 genes show evidence of some form of rate switching and that 3/5 genes show evidence that the relative switching rate differs among taxonomic groups. We conclude that covarion shifts may be frequent during the deep evolution of plastid genes and that our methodology may provide a powerful new tool for investigating such shifts in other systems.
Collapse
Affiliation(s)
- Simon Whelan
- Computational and Evolutionary Biology, Faculty of Life Sciences, University of Manchester, Manchester, United Kingdom.
| | | | | |
Collapse
|
21
|
Evans NM, Holder MT, Barbeitos MS, Okamura B, Cartwright P. The phylogenetic position of Myxozoa: exploring conflicting signals in phylogenomic and ribosomal data sets. Mol Biol Evol 2010; 27:2733-46. [PMID: 20576761 DOI: 10.1093/molbev/msq159] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
Myxozoans are a diverse group of microscopic endoparasites that have been the focus of much controversy regarding their phylogenetic position. Two dramatically different hypotheses have been put forward regarding the placement of Myxozoa within Metazoa. One hypothesis, supported by ribosomal DNA (rDNA) data, place Myxozoa as a sister taxon to Bilateria. The alternative hypothesis, supported by phylogenomic data and morphology, place Myxozoa within Cnidaria. Here, we investigate these conflicting hypotheses and explore the effects of missing data, model choice, and inference methods, all of which can have an effect in placing highly divergent taxa. In addition, we identify subsets of the data that most influence the placement of Myxozoa and explore their effects by removing them from the data sets. Assembling the largest taxonomic sampling of myxozoans and cnidarians to date, with a comprehensive sampling of other metazoans for 18S and 28S nuclear rDNA sequences, we recover a well-supported placement of Myxozoa as an early diverging clade of Bilateria. By conducting parametric bootstrapping, we find that the bilaterian placement of Buddenbrockia could not alone be explained by long-branch attraction. After trimming a published phylogenomic data set, to circumvent problems of missing data, we recover the myxozoan Buddenbrockia plumatellae as a medusozoan cnidarian. In further explorations of these data sets, we find that removal of just a few identified sites under a maximum likelihood criterion employing the Whelan and Goldman amino acid substitution model changes the placement of Buddenbrockia from within Cnidaria to the alternative hypothesis at the base of Bilateria. Under a Bayesian criterion employing the CAT model, the cnidarian placement is more resilient to data removal, but under one test, a well-supported early diverging bilaterian position for Buddenbrockia is recovered. Our results confirm the existence of two relatively stable placements for myxozoans and demonstrate that conflicting signal exists not only between the two types of data but also within the phylogenomic data set. These analyses underscore the importance of careful model selection, taxon and data sampling, and in-depth data exploration when investigating the phylogenetic placement of highly divergent taxa.
Collapse
Affiliation(s)
- Nathaniel M Evans
- Department of Ecology and Evolutionary Biology, University of Kansas, USA
| | | | | | | | | |
Collapse
|
22
|
Shavit Grievink L, Penny D, Hendy MD, Holland BR. Phylogenetic tree reconstruction accuracy and model fit when proportions of variable sites change across the tree. Syst Biol 2010; 59:288-97. [PMID: 20525636 PMCID: PMC2850392 DOI: 10.1093/sysbio/syq003] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Commonly used phylogenetic models assume a homogeneous process through time in all parts of the tree. However, it is known that these models can be too simplistic as they do not account for nonhomogeneous lineage-specific properties. In particular, it is now widely recognized that as constraints on sequences evolve, the proportion and positions of variable sites can vary between lineages causing heterotachy. The extent to which this model misspecification affects tree reconstruction is still unknown. Here, we evaluate the effect of changes in the proportions and positions of variable sites on model fit and tree estimation. We consider 5 current models of nucleotide sequence evolution in a Bayesian Markov chain Monte Carlo framework as well as maximum parsimony (MP). We show that for a tree with 4 lineages where 2 nonsister taxa undergo a change in the proportion of variable sites tree reconstruction under the best-fitting model, which is chosen using a relative test, often results in the wrong tree. In this case, we found that an absolute test of model fit is a better predictor of tree estimation accuracy. We also found further evidence that MP is not immune to heterotachy. In addition, we show that increased sampling of taxa that have undergone a change in proportion and positions of variable sites is critical for accurate tree reconstruction.
Collapse
Affiliation(s)
- Liat Shavit Grievink
- Institut für Botanik III, Heinrich-Heine Universität, Universitätstrasse 1, Düsseldorf, Germany.
| | | | | | | |
Collapse
|
23
|
Schwartz RS, Mueller RL. Limited effects of among-lineage rate variation on the phylogenetic performance of molecular markers. Mol Phylogenet Evol 2010; 54:849-56. [PMID: 20045073 DOI: 10.1016/j.ympev.2009.12.025] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2009] [Revised: 12/03/2009] [Accepted: 12/24/2009] [Indexed: 10/20/2022]
Abstract
Variation in substitution rates among evolutionary lineages (among-lineage rate variation or ALRV) has been reported to negatively affect the estimation of phylogenies. When the substitution processes underlying ALRV are modeled inadequately, non-sister taxa with similar substitution rates are estimated incorrectly as sister species due to long-branch attraction. Recent advances in modeling site-specific rate variation (heterotachy) have reduced the impacts of ALRV on phylogeny estimation in several empirical and simulated datasets. However, the addition of parameters to the substitution model reduces power to estimate each parameter correctly, which can also lead to incorrect phylogeny estimation. A potential solution to this problem is to identify the levels of ALRV that negatively impact phylogeny estimation such that molecular markers with non-deleterious levels of ALRV can be identified. To this end, we used analyses of empirical and simulated gene datasets to evaluate whether levels of ALRV identified in a mitochondrial genomic dataset for salamanders negatively impacted phylogeny estimation. We simulated data with and without ALRV, holding all other evolutionary parameters constant, and compared the phylogenetic performance of both simulated and empirical datasets. Overall, we found limited, positive effects of ALRV on phylogeny estimation in this dataset, the majority of which resulted from an increase in substitution rate on short branches. We conclude that ALRV does not always negatively impact phylogeny estimation. Therefore, ALRV can likely be disregarded as a criterion for marker selection in comparable phylogenetic studies.
Collapse
Affiliation(s)
- Rachel S Schwartz
- Department of Biology, Colorado State University, Fort Collins, CO 80523-1878, USA.
| | | |
Collapse
|
24
|
Kolaczkowski B, Thornton JW. Long-branch attraction bias and inconsistency in Bayesian phylogenetics. PLoS One 2009; 4:e7891. [PMID: 20011052 PMCID: PMC2785476 DOI: 10.1371/journal.pone.0007891] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2009] [Accepted: 10/12/2009] [Indexed: 11/24/2022] Open
Abstract
Bayesian inference (BI) of phylogenetic relationships uses the same probabilistic models of evolution as its precursor maximum likelihood (ML), so BI has generally been assumed to share ML's desirable statistical properties, such as largely unbiased inference of topology given an accurate model and increasingly reliable inferences as the amount of data increases. Here we show that BI, unlike ML, is biased in favor of topologies that group long branches together, even when the true model and prior distributions of evolutionary parameters over a group of phylogenies are known. Using experimental simulation studies and numerical and mathematical analyses, we show that this bias becomes more severe as more data are analyzed, causing BI to infer an incorrect tree as the maximum a posteriori phylogeny with asymptotically high support as sequence length approaches infinity. BI's long branch attraction bias is relatively weak when the true model is simple but becomes pronounced when sequence sites evolve heterogeneously, even when this complexity is incorporated in the model. This bias--which is apparent under both controlled simulation conditions and in analyses of empirical sequence data--also makes BI less efficient and less robust to the use of an incorrect evolutionary model than ML. Surprisingly, BI's bias is caused by one of the method's stated advantages--that it incorporates uncertainty about branch lengths by integrating over a distribution of possible values instead of estimating them from the data, as ML does. Our findings suggest that trees inferred using BI should be interpreted with caution and that ML may be a more reliable framework for modern phylogenetic analysis.
Collapse
Affiliation(s)
- Bryan Kolaczkowski
- Center for Ecology and Evolutionary Biology, University of Oregon, Eugene, Oregon, United States of America
| | - Joseph W. Thornton
- Center for Ecology and Evolutionary Biology, University of Oregon, Eugene, Oregon, United States of America
- Howard Hughes Medical Institute, University of Oregon, Eugene, Oregon, United States of America
| |
Collapse
|
25
|
Zhou Y, Brinkmann H, Rodrigue N, Lartillot N, Philippe H. A Dirichlet Process Covarion Mixture Model and Its Assessments Using Posterior Predictive Discrepancy Tests. Mol Biol Evol 2009; 27:371-84. [DOI: 10.1093/molbev/msp248] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
26
|
Wang HC, Susko E, Roger AJ. PROCOV: maximum likelihood estimation of protein phylogeny under covarion models and site-specific covarion pattern analysis. BMC Evol Biol 2009; 9:225. [PMID: 19737395 PMCID: PMC2758850 DOI: 10.1186/1471-2148-9-225] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2009] [Accepted: 09/08/2009] [Indexed: 11/12/2022] Open
Abstract
Background The covarion hypothesis of molecular evolution holds that selective pressures on a given amino acid or nucleotide site are dependent on the identity of other sites in the molecule that change throughout time, resulting in changes of evolutionary rates of sites along the branches of a phylogenetic tree. At the sequence level, covarion-like evolution at a site manifests as conservation of nucleotide or amino acid states among some homologs where the states are not conserved in other homologs (or groups of homologs). Covarion-like evolution has been shown to relate to changes in functions at sites in different clades, and, if ignored, can adversely affect the accuracy of phylogenetic inference. Results PROCOV (protein covarion analysis) is a software tool that implements a number of previously proposed covarion models of protein evolution for phylogenetic inference in a maximum likelihood framework. Several algorithmic and implementation improvements in this tool over previous versions make computationally expensive tree searches with covarion models more efficient and analyses of large phylogenomic data sets tractable. PROCOV can be used to identify covarion sites by comparing the site likelihoods under the covarion process to the corresponding site likelihoods under a rates-across-sites (RAS) process. Those sites with the greatest log-likelihood difference between a 'covarion' and an RAS process were found to be of functional or structural significance in a dataset of bacterial and eukaryotic elongation factors. Conclusion Covarion models implemented in PROCOV may be especially useful for phylogenetic estimation when ancient divergences between sequences have occurred and rates of evolution at sites are likely to have changed over the tree. It can also be used to study lineage-specific functional shifts in protein families that result in changes in the patterns of site variability among subtrees.
Collapse
Affiliation(s)
- Huai-Chun Wang
- Department of Mathematics and Statistics, Dalhousie University, Halifax, NS, Canada.
| | | | | |
Collapse
|
27
|
Abstract
Heterotachy is a general term to describe positions in a sequence that evolve at different rates in different lineages. Kolaczkowski and Thornton (2004. Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature 431:980-984.) recently described an intriguing heterotachy model that leads to topological bias for likelihood-based methods and parsimony methods. In this article, we show that heterotachy can generally be viewed as multivariate rates-across-sites variation, which can be described as randomly drawing rates (or branch lengths) from a multivariate distribution for each branch at each site. Motivated by this idea, we propose a pairwise alpha heterotachy adjustment model, which gives us much improved topological estimation in the settings by Kolaczkowski and Thornton (2004).
Collapse
Affiliation(s)
- Jihua Wu
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, Canada.
| | | |
Collapse
|
28
|
Blanga-Kanfi S, Miranda H, Penn O, Pupko T, DeBry RW, Huchon D. Rodent phylogeny revised: analysis of six nuclear genes from all major rodent clades. BMC Evol Biol 2009; 9:71. [PMID: 19341461 PMCID: PMC2674048 DOI: 10.1186/1471-2148-9-71] [Citation(s) in RCA: 185] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2008] [Accepted: 04/02/2009] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Rodentia is the most diverse order of placental mammals, with extant rodent species representing about half of all placental diversity. In spite of many morphological and molecular studies, the family-level relationships among rodents and the location of the rodent root are still debated. Although various datasets have already been analyzed to solve rodent phylogeny at the family level, these are difficult to combine because they involve different taxa and genes. RESULTS We present here the largest protein-coding dataset used to study rodent relationships. It comprises six nuclear genes, 41 rodent species, and eight outgroups. Our phylogenetic reconstructions strongly support the division of Rodentia into three clades: (1) a "squirrel-related clade", (2) a "mouse-related clade", and (3) Ctenohystrica. Almost all evolutionary relationships within these clades are also highly supported. The primary remaining uncertainty is the position of the root. The application of various models and techniques aimed to remove non-phylogenetic signal was unable to solve the basal rodent trifurcation. CONCLUSION Sequencing and analyzing a large sequence dataset enabled us to resolve most of the evolutionary relationships among Rodentia. Our findings suggest that the uncertainty regarding the position of the rodent root reflects the rapid rodent radiation that occurred in the Paleocene rather than the presence of conflicting phylogenetic and non-phylogenetic signals in the dataset.
Collapse
Affiliation(s)
- Shani Blanga-Kanfi
- Department of Zoology, George S, Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv 69978, Israel.
| | | | | | | | | | | |
Collapse
|
29
|
Clemente JC, Ikeo K, Valiente G, Gojobori T. Optimized ancestral state reconstruction using Sankoff parsimony. BMC Bioinformatics 2009; 10:51. [PMID: 19200389 PMCID: PMC2677398 DOI: 10.1186/1471-2105-10-51] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2008] [Accepted: 02/07/2009] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Parsimony methods are widely used in molecular evolution to estimate the most plausible phylogeny for a set of characters. Sankoff parsimony determines the minimum number of changes required in a given phylogeny when a cost is associated to transitions between character states. Although optimizations exist to reduce the computations in the number of taxa, the original algorithm takes time O(n(2)) in the number of states, making it impractical for large values of n. RESULTS In this study we introduce an optimization of Sankoff parsimony for the reconstruction of ancestral states when ultrametric or additive cost matrices are used. We analyzed its performance for randomly generated matrices, Jukes-Cantor and Kimura's two-parameter models of DNA evolution, and in the reconstruction of elongation factor-1alpha and ancestral metabolic states of a group of eukaryotes, showing that in all cases the execution time is significantly less than with the original implementation. CONCLUSION The algorithms here presented provide a fast computation of Sankoff parsimony for a given phylogeny. Problems where the number of states is large, such as reconstruction of ancestral metabolism, are particularly adequate for this optimization. Since we are reducing the computations required to calculate the parsimony cost of a single tree, our method can be combined with optimizations in the number of taxa that aim at finding the most parsimonious tree.
Collapse
Affiliation(s)
- José C Clemente
- Center for Information Biology and DNA Databank of Japan, National Institute of Genetics, Yata 1111, Mishima, Japan
| | - Kazuho Ikeo
- Center for Information Biology and DNA Databank of Japan, National Institute of Genetics, Yata 1111, Mishima, Japan
| | | | - Takashi Gojobori
- Center for Information Biology and DNA Databank of Japan, National Institute of Genetics, Yata 1111, Mishima, Japan
| |
Collapse
|
30
|
Walsh DA, Sharma AK. Molecular phylogenetics: testing evolutionary hypotheses. Methods Mol Biol 2009; 502:131-168. [PMID: 19082555 DOI: 10.1007/978-1-60327-565-1_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
A common approach for investigating evolutionary relationships between genes and organisms is to compare extant DNA or protein sequences and infer an evolutionary tree. This methodology is known as molecular phylogenetics and may be the most informative means for exploring phage evolution, since there are few morphological features that can be used to differentiate between these tiny biological entities. In addition, phage genomes can be mosaic, meaning different genes or genomic regions can exhibit conflicting evolutionary histories due to lateral gene transfer or homologous recombination between different phage genomes. Molecular phylogenetics can be used to identify and study such genome mosaicism. This chapter provides a general introduction to the theory and methodology used to reconstruct phylogenetic relationships from molecular data. Also included is a discussion on how the evolutionary history of different genes within the same set of genomes can be compared, using a collection of T4-type phage genomes as an example. A compilation of programs and packages that are available for conducting phylogenetic analyses is supplied as an accompanying appendix.
Collapse
Affiliation(s)
- David A Walsh
- Department of Biochemistry and Molecular Biology, Dalhousie University, Nova Scotia, Canada
| | | |
Collapse
|
31
|
Wang HC, Li K, Susko E, Roger AJ. A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny. BMC Evol Biol 2008; 8:331. [PMID: 19087270 PMCID: PMC2628903 DOI: 10.1186/1471-2148-8-331] [Citation(s) in RCA: 64] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2008] [Accepted: 12/16/2008] [Indexed: 11/25/2022] Open
Abstract
Background Widely used substitution models for proteins, such as the Jones-Taylor-Thornton (JTT) or Whelan and Goldman (WAG) models, are based on empirical amino acid interchange matrices estimated from databases of protein alignments that incorporate the average amino acid frequencies of the data set under examination (e.g JTT + F). Variation in the evolutionary process between sites is typically modelled by a rates-across-sites distribution such as the gamma (Γ) distribution. However, sites in proteins also vary in the kinds of amino acid interchanges that are favoured, a feature that is ignored by standard empirical substitution matrices. Here we examine the degree to which the pattern of evolution at sites differs from that expected based on empirical amino acid substitution models and evaluate the impact of these deviations on phylogenetic estimation. Results We analyzed 21 large protein alignments with two statistical tests designed to detect deviation of site-specific amino acid distributions from data simulated under the standard empirical substitution model: JTT+ F + Γ. We found that the number of states at a given site is, on average, smaller and the frequencies of these states are less uniform than expected based on a JTT + F + Γ substitution model. With a four-taxon example, we show that phylogenetic estimation under the JTT + F + Γ model is seriously biased by a long-branch attraction artefact if the data are simulated under a model utilizing the observed site-specific amino acid frequencies from an alignment. Principal components analyses indicate the existence of at least four major site-specific frequency classes in these 21 protein alignments. Using a mixture model with these four separate classes of site-specific state frequencies plus a fifth class of global frequencies (the JTT + cF + Γ model), significant improvements in model fit for real data sets can be achieved. This simple mixture model also reduces the long-branch attraction problem, as shown by simulations and analyses of a real phylogenomic data set. Conclusion Protein families display site-specific evolutionary dynamics that are ignored by standard protein phylogenetic models. Accurate estimation of protein phylogenies requires models that accommodate the heterogeneity in the evolutionary process across sites. To this end, we have implemented a class frequency mixture model (cF) in a freely available program called QmmRAxML for phylogenetic estimation.
Collapse
Affiliation(s)
- Huai-Chun Wang
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, N,S, B3H 1X5, Canada.
| | | | | | | |
Collapse
|
32
|
Abstract
The origin of the eukaryotic genetic apparatus is thought to be central to understanding the evolution of the eukaryotic cell. Disagreement about the source of the relevant genes has spawned competing hypotheses for the origins of the eukaryote nuclear lineage. The iconic rooted 3-domains tree of life shows eukaryotes and archaebacteria as separate groups that share a common ancestor to the exclusion of eubacteria. By contrast, the eocyte hypothesis has eukaryotes originating within the archaebacteria and sharing a common ancestor with a particular group called the Crenarchaeota or eocytes. Here, we have investigated the relative support for each hypothesis from analysis of 53 genes spanning the 3 domains, including essential components of the eukaryotic nucleic acid replication, transcription, and translation apparatus. As an important component of our analysis, we investigated the fit between model and data with respect to composition. Compositional heterogeneity is a pervasive problem for reconstruction of ancient relationships, which, if ignored, can produce an incorrect tree with strong support. To mitigate its effects, we used phylogenetic models that allow for changing nucleotide or amino acid compositions over the tree and data. Our analyses favor a topology that supports the eocyte hypothesis rather than archaebacterial monophyly and the 3-domains tree of life.
Collapse
|
33
|
Shavit Grievink L, Penny D, Hendy MD, Holland BR. LineageSpecificSeqgen: generating sequence data with lineage-specific variation in the proportion of variable sites. BMC Evol Biol 2008; 8:317. [PMID: 19021917 PMCID: PMC2613921 DOI: 10.1186/1471-2148-8-317] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2008] [Accepted: 11/21/2008] [Indexed: 11/10/2022] Open
Abstract
Background Commonly used phylogenetic models assume a homogeneous evolutionary process throughout the tree. It is known that these homogeneous models are often too simplistic, and that with time some properties of the evolutionary process can change (due to selection or drift). In particular, as constraints on sequences evolve, the proportion of variable sites can vary between lineages. This affects the ability of phylogenetic methods to correctly estimate phylogenetic trees, especially for long timescales. To date there is no phylogenetic model that allows for change in the proportion of variable sites, and the degree to which this affects phylogenetic reconstruction is unknown. Results We present LineageSpecificSeqgen, an extension to the seq-gen program that allows generation of sequences with both changes in the proportion of variable sites and changes in the rate at which sites switch between being variable and invariable. In contrast to seq-gen and its derivatives to date, we interpret branch lengths as the mean number of substitutions per variable site, as opposed to the mean number of substitutions per site (which is averaged over all sites, including invariable sites). This allows specification of the substitution rates of variable sites, independently of the proportion of invariable sites. Conclusion LineageSpecificSeqgen allows simulation of DNA and amino acid sequence alignments under a lineage-specific evolutionary process. The program can be used to test current models of evolution on sequences that have undergone lineage-specific evolution. It facilitates the development of both new methods to identify such processes in real data, and means to account for such processes. The program is available at: http://awcmee.massey.ac.nz/downloads.htm.
Collapse
Affiliation(s)
- Liat Shavit Grievink
- The Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, Private Bag 11 222, Palmerston North, New Zealand.
| | | | | | | |
Collapse
|
34
|
Kostka M, Uzlikova M, Cepicka I, Flegr J. SlowFaster, a user-friendly program for slow-fast analysis and its application on phylogeny of Blastocystis. BMC Bioinformatics 2008; 9:341. [PMID: 18702831 PMCID: PMC2529323 DOI: 10.1186/1471-2105-9-341] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2007] [Accepted: 08/15/2008] [Indexed: 11/15/2022] Open
Abstract
Background Slow-fast analysis is a simple and effective method to reduce the influence of substitution saturation, one of the causes of phylogenetic noise and long branch attraction (LBA) artifacts. In several steps of increasing stringency, the slow-fast analysis omits the fastest substituting alignment positions from the analysed dataset and thus increases its signal/noise ratio. Results Our program SlowFaster automates the process of assessing the substitution rate of the alignment positions and the process of producing new alignments by deleting the saturated positions. Its use is very simple. It goes through the whole process in several steps: data input – necessary choices – production of new alignments. Conclusion SlowFaster is a user-friendly tool providing new alignments prepared with slow-fast analysis. These data can be used for further phylogenetic analyses with lower risk of long branch attraction artifacts.
Collapse
Affiliation(s)
- Martin Kostka
- Department of Parasitology, Faculty of Science, Charles University, Vinicna 7, 128 44 Prague, Czech Republic.
| | | | | | | |
Collapse
|
35
|
Kolaczkowski B, Thornton JW. A mixed branch length model of heterotachy improves phylogenetic accuracy. Mol Biol Evol 2008; 25:1054-66. [PMID: 18319244 PMCID: PMC3299401 DOI: 10.1093/molbev/msn042] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/04/2008] [Indexed: 11/14/2022] Open
Abstract
Evolutionary relationships are typically inferred from molecular sequence data using a statistical model of the evolutionary process. When the model accurately reflects the underlying process, probabilistic phylogenetic methods recover the correct relationships with high accuracy. There is ample evidence, however, that models commonly used today do not adequately reflect real-world evolutionary dynamics. Virtually all contemporary models assume that relatively fast-evolving sites are fast across the entire tree, whereas slower sites always evolve at relatively slower rates. Many molecular sequences, however, exhibit site-specific changes in evolutionary rates, called "heterotachy." Here we examine the accuracy of 2 phylogenetic methods for incorporating heterotachy, the mixed branch length model--which incorporates site-specific rate changes by summing likelihoods over multiple sets of branch lengths on the same tree--and the covarion model, which uses a hidden Markov process to allow sites to switch between variable and invariable as they evolve. Under a variety of simple heterogeneous simulation conditions, the mixed model was dramatically more accurate than homotachous models, which were subject to topological biases as well as biases in branch length estimates. When data were simulated with strong versions of the types of heterotachy observed in real molecular sequences, the mixed branch length model was more accurate than homotachous techniques. Analyses of empirical data sets confirmed that the mixed branch length model can improve phylogenetic accuracy under conditions that cause homotachous models to fail. In contrast, the covarion model did not improve phylogenetic accuracy compared with homotachous models and was sometimes substantially less accurate. We conclude that a mixed branch length approach, although not the solution to all phylogenetic errors, is a valuable strategy for improving the accuracy of inferred trees.
Collapse
|
36
|
Gruenheit N, Lockhart PJ, Steel M, Martin W. Difficulties in testing for covarion-like properties of sequences under the confounding influence of changing proportions of variable sites. Mol Biol Evol 2008; 25:1512-20. [PMID: 18424773 DOI: 10.1093/molbev/msn098] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The covarion (COV)-like properties of sequences are poorly described and their impact on phylogenetic analyses poorly understood. We demonstrate using simulations that, under an evolutionary model where the proportion of variable sites changes in nonadjacent lineages, log likelihood values for rates across site (RAS) and COV models become similar, making models difficult to distinguish. Further, although COV and RAS models provide a great improvement in likelihood scores over a homogeneous model with these simulated data, reconstruction accuracy of tree building is low, suggesting caution when it is suspected that proportions of variable sites differ in different evolutionary lineages. We study the performance of a recently developed contingency test that detects the presence of COV-type evolution modified for protein data. We report that if proportions of variable sites (p(var)) change in a lineage-specific manner such that their distributions in different lineages become sufficiently nonoverlapping, then the contingency test can incorrectly suggest a homogeneous model. Also of concern is the possibility of different proportions of variable sites between the groups being studied. In a study of chloroplast proteins, interpretation of the test is found to be susceptible to different partitioning of taxon groups, making the test very subjective in its implementation. Extreme intergroup differences in the extent of divergence and difference in proportions of variable sites could be contributing to this effect.
Collapse
Affiliation(s)
- Nicole Gruenheit
- Institute of Botany III, University of Düsseldorf, Düsseldorf, Germany.
| | | | | | | |
Collapse
|
37
|
Wu J, Susko E, Roger AJ. An independent heterotachy model and its implications for phylogeny and divergence time estimation. Mol Phylogenet Evol 2008; 46:801-6. [PMID: 17716923 DOI: 10.1016/j.ympev.2007.06.020] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2007] [Revised: 06/13/2007] [Accepted: 06/29/2007] [Indexed: 10/23/2022]
Affiliation(s)
- Jihua Wu
- Department of Mathematics and Statistics, Dalhousie University, Halifax, NS, Canada B3H 3J5.
| | | | | |
Collapse
|
38
|
Whitfield JB, Kjer KM. Ancient rapid radiations of insects: challenges for phylogenetic analysis. ANNUAL REVIEW OF ENTOMOLOGY 2008; 53:449-72. [PMID: 17877448 DOI: 10.1146/annurev.ento.53.103106.093304] [Citation(s) in RCA: 117] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
Phylogenies of major groups of insects based on both morphological and molecular data have sometimes been contentious, often lacking the data to distinguish between alternative views of relationships. This paucity of data is often due to real biological and historical causes, such as shortness of time spans between divergences for evolution to occur and long time spans after divergences for subsequent evolutionary changes to obscure the earlier ones. Another reason for difficulty in resolving some of the relationships using molecular data is the limited spectrum of genes so far developed for phylogeny estimation. For this latter issue, there is cause for current optimism owing to rapid increases in our knowledge of comparative genomics. At least some historical patterns of divergence may, however, continue to defy our attempts to completely reconstruct them with confidence, at least using current strategies.
Collapse
Affiliation(s)
- James B Whitfield
- Department of Entomology, University of Illinois, Urbana, IL 61821, USA.
| | | |
Collapse
|
39
|
Wang HC, Susko E, Spencer M, Roger AJ. Topological estimation biases with covarion evolution. J Mol Evol 2007; 66:50-60. [PMID: 18080080 DOI: 10.1007/s00239-007-9062-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2007] [Revised: 11/02/2007] [Accepted: 11/19/2007] [Indexed: 10/22/2022]
Abstract
Covarion processes allow changes in evolutionary rates at sites along the branches of a phylogenetic tree. Covarion-like evolution is increasingly recognized as an important mode of protein evolution. Several recent reports suggest that maximum likelihood estimation employing covarion models may support different optimal topologies than estimation using standard rates-across-sites (RAS) models. However, it remains to be demonstrated that ignoring covarion evolution will generally result in topological misestimation. In this study we performed analytical and theoretical studies of limiting distances under the covarion model and four-taxon tree simulations to investigate the extent to which the covarion process impacts on phylogenetic estimation. In particular, we assessed the limits of an RAS model-based maximum likelihood method to recover the phylogenies when the sequence data were simulated under the covarion processes. We find that, when ignored, covarion processes can induce systematic errors in phylogeny reconstruction. Surprisingly, when sequences are evolved under a covarion process but an RAS model is used for estimation, we find that a long branch repel bias occurs.
Collapse
Affiliation(s)
- Huai-Chun Wang
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, Canada.
| | | | | | | |
Collapse
|
40
|
Zhou Y, Rodrigue N, Lartillot N, Philippe H. Evaluation of the models handling heterotachy in phylogenetic inference. BMC Evol Biol 2007; 7:206. [PMID: 17974035 PMCID: PMC2248194 DOI: 10.1186/1471-2148-7-206] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2007] [Accepted: 11/01/2007] [Indexed: 11/30/2022] Open
Abstract
Background The evolutionary rate at a given homologous position varies across time. When sufficiently pronounced, this phenomenon – called heterotachy – may produce artefactual phylogenetic reconstructions under the commonly used models of sequence evolution. These observations have motivated the development of models that explicitly recognize heterotachy, with research directions proposed along two main axes: 1) the covarion approach, where sites switch from variable to invariable states; and 2) the mixture of branch lengths (MBL) approach, where alignment patterns are assumed to arise from one of several sets of branch lengths, under a given phylogeny. Results Here, we report the first statistical comparisons contrasting the performance of covarion and MBL modeling strategies. Using simulations under heterotachous conditions, we explore the properties of three model comparison methods: the Akaike information criterion, the Bayesian information criterion, and cross validation. Although more time consuming, cross validation appears more reliable than AIC and BIC as it directly measures the predictive power of a model on 'future' data. We also analyze three large datasets (nuclear proteins of animals, mitochondrial proteins of mammals, and plastid proteins of plants), and find the optimal number of components of the MBL model to be two for all datasets, indicating that this model is preferred over the standard homogeneous model. However, the covarion model is always favored over the optimal MBL model. Conclusion We demonstrated, using three large datasets, that the covarion model is more efficient at handling heterotachy than the MBL model. This is probably due to the fact that the MBL model requires a serious increase in the number of parameters, as compared to two supplementary parameters of the covarion approach. Further improvements of the both the mixture and the covarion approaches might be obtained by modeling heterogeneous behavior both along time and across sites.
Collapse
Affiliation(s)
- Yan Zhou
- Canadian Institute for Advanced Research, Département de Biochimie, Université de Montréal, Succursale Centre-Ville, Montréal, Québec H3C3J7, Canada.
| | | | | | | |
Collapse
|
41
|
Wildman DE, Uddin M, Opazo JC, Liu G, Lefort V, Guindon S, Gascuel O, Grossman LI, Romero R, Goodman M. Genomics, biogeography, and the diversification of placental mammals. Proc Natl Acad Sci U S A 2007; 104:14395-400. [PMID: 17728403 PMCID: PMC1958817 DOI: 10.1073/pnas.0704342104] [Citation(s) in RCA: 143] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2007] [Indexed: 11/18/2022] Open
Abstract
Previous molecular analyses of mammalian evolutionary relationships involving a wide range of placental mammalian taxa have been restricted in size from one to two dozen gene loci and have not decisively resolved the basal branching order within Placentalia. Here, on extracting from thousands of gene loci both their coding nucleotide sequences and translated amino acid sequences, we attempt to resolve key uncertainties about the ancient branching pattern of crown placental mammals. Focusing on approximately 1,700 conserved gene loci, those that have the more slowly evolving coding sequences, and using maximum-likelihood, Bayesian inference, maximum parsimony, and neighbor-joining (NJ) phylogenetic tree reconstruction methods, we find from almost all results that a clade (the southern Atlantogenata) composed of Afrotheria and Xenarthra is the sister group of all other (the northern Boreoeutheria) crown placental mammals, among boreoeutherians Rodentia groups with Lagomorpha, and the resultant Glires is close to Primates. Only the NJ tree for nucleotide sequences separates Rodentia (murids) first and then Lagomorpha (rabbit) from the other placental mammals. However, this nucleotide NJ tree still depicts Atlantogenata and Boreoeutheria but minus Rodentia and Lagomorpha. Moreover, the NJ tree for amino acid sequences does depict the basal separation to be between Atlantogenata and a Boreoeutheria that includes Rodentia and Lagomorpha. Crown placental mammalian diversification appears to be largely the result of ancient plate tectonic events that allowed time for convergent phenotypes to evolve in the descendant clades.
Collapse
Affiliation(s)
- Derek E. Wildman
- Perinatology Research Branch, National Institute of Child Health and Human Development/National Institutes of Health, Department of Health and Human Services, Bethesda, MD 20892
- Center For Molecular Medicine and Genetics, and
- Departments of Obstetrics and Gynecology and
| | | | - Juan C. Opazo
- Center For Molecular Medicine and Genetics, and
- School of Biological Sciences, University of Nebraska, Lincoln, NE 68588; and
| | - Guozhen Liu
- Center For Molecular Medicine and Genetics, and
| | - Vincent Lefort
- Laboratory of Computer Science, Robotics, and Microelectronics, Centre National de la Recherche Scientifique, Université Montpellier II, 161 Rue Ada, 34392 Montpellier, France
| | - Stephane Guindon
- Laboratory of Computer Science, Robotics, and Microelectronics, Centre National de la Recherche Scientifique, Université Montpellier II, 161 Rue Ada, 34392 Montpellier, France
| | - Olivier Gascuel
- Laboratory of Computer Science, Robotics, and Microelectronics, Centre National de la Recherche Scientifique, Université Montpellier II, 161 Rue Ada, 34392 Montpellier, France
| | | | - Roberto Romero
- Perinatology Research Branch, National Institute of Child Health and Human Development/National Institutes of Health, Department of Health and Human Services, Bethesda, MD 20892
| | - Morris Goodman
- Center For Molecular Medicine and Genetics, and
- Anatomy and Cell Biology, Wayne State University, Detroit, MI 48201
| |
Collapse
|
42
|
Wägele JW, Mayer C. Visualizing differences in phylogenetic information content of alignments and distinction of three classes of long-branch effects. BMC Evol Biol 2007; 7:147. [PMID: 17725833 PMCID: PMC2040160 DOI: 10.1186/1471-2148-7-147] [Citation(s) in RCA: 91] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2007] [Accepted: 08/28/2007] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Published molecular phylogenies are usually based on data whose quality has not been explored prior to tree inference. This leads to errors because trees obtained with conventional methods suppress conflicting evidence, and because support values may be high even if there is no distinct phylogenetic signal. Tools that allow an a priori examination of data quality are rarely applied. RESULTS Using data from published molecular analyses on the phylogeny of crustaceans it is shown that tree topologies and popular support values do not show existing differences in data quality. To visualize variations in signal distinctness, we use network analyses based on split decomposition and split support spectra. Both methods show the same differences in data quality and the same clade-supporting patterns. Both methods are useful to discover long-branch effects. We discern three classes of long branch effects. Class I effects consist of attraction of terminal taxa caused by symplesiomorphies, which results in a false monophyly of paraphyletic groups. Addition of carefully selected taxa can fix this effect. Class II effects are caused by drastic signal erosion. Long branches affected by this phenomenon usually slip down the tree to form false clades that in reality are polyphyletic. To recover the correct phylogeny, more conservative genes must be used. Class III effects consist of attraction due to accumulated chance similarities or convergent character states. This sort of noise can be reduced by selecting less variable portions of the data set, avoiding biases, and adding slower genes. CONCLUSION To increase confidence in molecular phylogenies an exploratory analysis of the signal to noise ratio can be conducted with split decomposition methods. If long-branch effects are detected, it is necessary to discern between three classes of effects to find the best approach for an improvement of the raw data.
Collapse
Affiliation(s)
| | - Christoph Mayer
- Lehrstuhl Spezielle Zoologie, Faculty of Biology, University Bochum, 44780 Bochum, Germany
| |
Collapse
|
43
|
Merlo LMF, Lunzer M, Dean AM. An empirical test of the concomitantly variable codon hypothesis. Proc Natl Acad Sci U S A 2007; 104:10938-43. [PMID: 17578921 PMCID: PMC1904112 DOI: 10.1073/pnas.0701900104] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
A central assumption of models of molecular evolution, that each site in a sequence evolves independently of all other sites, lacks empirical support. We investigated the extent to which sites evolve codependently in triosephosphate isomerase (TIM), a ubiquitous glycolytic enzyme conserved in both structure and function. Codependencies among sites, or concomitantly variable codons (covarions), are evident from the reduced function and misfolding of hybrid TIM proteins. Although they exist, we find covarions are relatively rare, and closely related proteins are unlikely to have developed them. However, the potential for covarions increases with genetic distance so that highly divergent proteins may have evolved codependencies between many sites. The evolution of covarions undermines a key assumption in phylogenetics and calls into question our ability to disentangle ancient relationships among major taxonomic groups.
Collapse
Affiliation(s)
- Lauren M. F. Merlo
- *Department of Ecology, Evolution, and Behavior, University of Minnesota, 100 Ecology Building, 1987 Upper Buford Circle, Saint Paul, MN 55108; and
| | - Mark Lunzer
- BioTechnology Institute, University of Minnesota, 140 Gortner Laboratory, 1479 Gortner Avenue, Saint Paul, MN 55108
| | - Antony M. Dean
- *Department of Ecology, Evolution, and Behavior, University of Minnesota, 100 Ecology Building, 1987 Upper Buford Circle, Saint Paul, MN 55108; and
- BioTechnology Institute, University of Minnesota, 140 Gortner Laboratory, 1479 Gortner Avenue, Saint Paul, MN 55108
- To whom correspondence should be addressed. E-mail:
| |
Collapse
|
44
|
Ruano-Rubio V, Fares MA. Artifactual phylogenies caused by correlated distribution of substitution rates among sites and lineages: the good, the bad, and the ugly. Syst Biol 2007; 56:68-82. [PMID: 17366138 DOI: 10.1080/10635150601175578] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022] Open
Abstract
Despite the advances in understanding molecular evolution, current phylogenetic methods barely take account of a fraction of the complexity of evolution. We are chiefly constrained by our incomplete knowledge of molecular evolutionary processes and the limits of computational power. These limitations lead to the establishment of either biologically simplistic models that rarely account for a fraction of the complexity involved or overfitting models that add little resolution to the problem. Such oversimplified models may lead us to assign high confidence to an incorrect tree (inconsistency). Rate-across-site (RAS) models are commonly used evolutionary models in phylogenetic studies. These account for heterogeneity in the evolutionary rates among sites but do not account for changing within-site rates across lineages (heterotachy). If heterotachy is common, using RAS models may lead to systematic errors in tree inference. In this work we show possible misleading effects in tree inference when the assumption of constant within-site rates across lineages is violated using maximum likelihood. Using a simulation study, we explore the ways in which gamma stationary models can lead to wrong topology or to deceptive bootstrap support values when the within-site rates change across lineages. More precisely, we show that different degrees of heterotachy mislead phylogenetic inference when the model assumed is stationary. Finally, we propose a geometry-based approach to visualize and to test for the possible existence of bias due to heterotachy.
Collapse
Affiliation(s)
- Valentin Ruano-Rubio
- Molecular Evolution and Bioinformatics Laboratory, Department of Biology, National University of Ireland, Maynooth, Ireland
| | | |
Collapse
|
45
|
Sanchez-Puerta MV, Bachvaroff TR, Delwiche CF. Sorting wheat from chaff in multi-gene analyses of chlorophyll c-containing plastids. Mol Phylogenet Evol 2007; 44:885-97. [PMID: 17449283 DOI: 10.1016/j.ympev.2007.03.003] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2006] [Revised: 02/24/2007] [Accepted: 03/05/2007] [Indexed: 10/23/2022]
Abstract
Photosynthetic eukaryotes contain primary, secondary or tertiary plastids, depending on the source of the organelle (a cyanobacterium or a photosynthetic eukaryote). Plastid phylogeny is relatively well investigated, but molecular phylogenies have conflicted as a function of gene choice, taxon-representations, and analytical method. To better understand the influences of these variables, we performed analyses of a multi-gene data set based on 62 plastid-associated genes of 15 taxa representing the major plastid lineages. In an attempt to distinguish phylogenetic signal from non-phylogenetic patterns, we analyzed the data using a wide range of phylogenetic methods and examined the effect of covarion evolution and compositional bias. The data suggest that the chlorophyll c-containing plastids are monophyletic and acquired their plastids from the red algae after the emergence of the Cyanidiales. The relationships among chl c-containing plastids are particularly hard to resolve. This is the largest data set used for this purpose; the analyses show that cryptophyte plastids are sister to other chl c-containing plastids, and haptophyte and peridinin-containing dinoflagellate plastids are closely related.
Collapse
Affiliation(s)
- M Virginia Sanchez-Puerta
- Department of Cell Biology and Molecular Genetics, University of Maryland College Park, College Park, MD 20742-5815, USA.
| | | | | |
Collapse
|
46
|
Lartillot N, Brinkmann H, Philippe H. Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model. BMC Evol Biol 2007; 7 Suppl 1:S4. [PMID: 17288577 PMCID: PMC1796613 DOI: 10.1186/1471-2148-7-s1-s4] [Citation(s) in RCA: 426] [Impact Index Per Article: 23.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Thanks to the large amount of signal contained in genome-wide sequence alignments, phylogenomic analyses are converging towards highly supported trees. However, high statistical support does not imply that the tree is accurate. Systematic errors, such as the Long Branch Attraction (LBA) artefact, can be misleading, in particular when the taxon sampling is poor, or the outgroup is distant. In an otherwise consistent probabilistic framework, systematic errors in genome-wide analyses can be traced back to model mis-specification problems, which suggests that better models of sequence evolution should be devised, that would be more robust to tree reconstruction artefacts, even under the most challenging conditions. METHODS We focus on a well characterized LBA artefact analyzed in a previous phylogenomic study of the metazoan tree, in which two fast-evolving animal phyla, nematodes and platyhelminths, emerge either at the base of all other Bilateria, or within protostomes, depending on the outgroup. We use this artefactual result as a case study for comparing the robustness of two alternative models: a standard, site-homogeneous model, based on an empirical matrix of amino-acid replacement (WAG), and a site-heterogeneous mixture model (CAT). In parallel, we propose a posterior predictive test, allowing one to measure how well a model acknowledges sequence saturation. RESULTS Adopting a Bayesian framework, we show that the LBA artefact observed under WAG disappears when the site-heterogeneous model CAT is used. Using cross-validation, we further demonstrate that CAT has a better statistical fit than WAG on this data set. Finally, using our statistical goodness-of-fit test, we show that CAT, but not WAG, correctly accounts for the overall level of saturation, and that this is due to a better estimation of site-specific amino-acid preferences. CONCLUSION The CAT model appears to be more robust than WAG against LBA artefacts, essentially because it correctly anticipates the high probability of convergences and reversions implied by the small effective size of the amino-acid alphabet at each site of the alignment. More generally, our results provide strong evidence that site-specificities in the substitution process need be accounted for in order to obtain more reliable phylogenetic trees.
Collapse
Affiliation(s)
- Nicolas Lartillot
- Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier, UMR 5506, CNRS-Université de Montpellier 2, 161, rue Ada, 34392 Montpellier Cedex 5, France
| | - Henner Brinkmann
- Canadian Institute for Advanced Research, Département de Biochimie, Université de Montréal, Montréal, Québec Canada
| | - Hervé Philippe
- Canadian Institute for Advanced Research, Département de Biochimie, Université de Montréal, Montréal, Québec Canada
| |
Collapse
|
47
|
Abstract
Background The rate of evolution varies spatially along genomes and temporally in time. The presence of evolutionary rate variation is an informative signal that often marks functional regions of genomes and historical selection events. There exist many tests for temporal rate variation, or heterotachy, that start by partitioning sampled sequences into two or more groups and testing rate homogeneity among the groups. I develop a Bayesian method to infer phylogenetic trees with a divergence point, or dramatic temporal shifts in selection pressure that affect many nucleotide sites simultaneously, located at an unknown position in the tree. Results Simulation demonstrates that the method is most able to detect divergence points when rate variation and the number of affected sites is high, but not beyond biologically relevant values. The method is applied to two viral data sets. A divergence point is identified separating the B and C subtypes, two genetically distinct variants of HIV that have spread into different human populations with the AIDS epidemic. In contrast, no strong signal of temporal rate variation is found in a sample of F and H genotypes, two genetic variants of HBV that have likely evolved with humans during their immigration and expansion into the Americas. Conclusion Temporal shifts in evolutionary rate of sufficient magnitude are detectable in the history of sampled sequences. The ability to detect such divergence points without the need to specify a prior hypothesis about the location or timing of the divergence point should help scientists identify historically important selection events and decipher mechanisms of evolution.
Collapse
Affiliation(s)
- Karin S Dorman
- Department of Statistics, and the Program in Bioinformatics and Computational Biology, Iowa State University, Ames, IA, USA.
| |
Collapse
|
48
|
Parfrey LW, Barbero E, Lasser E, Dunthorn M, Bhattacharya D, Patterson DJ, Katz LA. Evaluating support for the current classification of eukaryotic diversity. PLoS Genet 2006; 2:e220. [PMID: 17194223 PMCID: PMC1713255 DOI: 10.1371/journal.pgen.0020220] [Citation(s) in RCA: 125] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2006] [Accepted: 11/09/2006] [Indexed: 11/19/2022] Open
Abstract
Perspectives on the classification of eukaryotic diversity have changed rapidly in recent years, as the four eukaryotic groups within the five-kingdom classification—plants, animals, fungi, and protists—have been transformed through numerous permutations into the current system of six “supergroups.” The intent of the supergroup classification system is to unite microbial and macroscopic eukaryotes based on phylogenetic inference. This supergroup approach is increasing in popularity in the literature and is appearing in introductory biology textbooks. We evaluate the stability and support for the current six-supergroup classification of eukaryotes based on molecular genealogies. We assess three aspects of each supergroup: (1) the stability of its taxonomy, (2) the support for monophyly (single evolutionary origin) in molecular analyses targeting a supergroup, and (3) the support for monophyly when a supergroup is included as an out-group in phylogenetic studies targeting other taxa. Our analysis demonstrates that supergroup taxonomies are unstable and that support for groups varies tremendously, indicating that the current classification scheme of eukaryotes is likely premature. We highlight several trends contributing to the instability and discuss the requirements for establishing robust clades within the eukaryotic tree of life. Evolutionary perspectives, including the classification of living organisms, provide the unifying scaffold on which biological knowledge is assembled. Researchers in many areas of biology use evolutionary classifications (taxonomy) in many ways, including as a means for interpreting the origin of evolutionary innovations, as a framework for comparative genetics/genomics, and as the basis for drawing broad conclusions about the diversity of living organisms. Thus, it is essential that taxonomy be robust. Here the authors evaluate the stability of and support for the current classification system of eukaryotic cells (cells with nuclei) in which eukaryotes are divided into six kingdom level categories, or supergroups. These six supergroups unite diverse microbial and macrobial eukaryotic lineages, including the well-known groups of plants, animals, and fungi. The authors assess the stability of supergroup classifications through time and reveal a rapidly changing taxonomic landscape that is difficult to navigate for the specialist and generalist alike. Additionally, the authors find variable support for each of the supergroups in published analyses based on DNA sequence variation. The support for supergroups differs according to the taxonomic area under study and the origin of the genes (e.g., nuclear, plastid) used in the analysis. Encouragingly, combining a conservative approach to taxonomy with increased sampling of microbial eukaryotes and the use of multiple types of data is likely to produce a robust scaffold for the eukaryotic tree of life.
Collapse
Affiliation(s)
- Laura Wegener Parfrey
- Program in Organismic and Evolutionary Biology, University of Massachusetts, Amherst, Massachusetts, United States of America
| | - Erika Barbero
- Department of Biological Sciences, Smith College, Northampton, Massachusetts, United States of America
| | - Elyse Lasser
- Department of Biological Sciences, Smith College, Northampton, Massachusetts, United States of America
| | - Micah Dunthorn
- Program in Organismic and Evolutionary Biology, University of Massachusetts, Amherst, Massachusetts, United States of America
| | - Debashish Bhattacharya
- Department of Biological Sciences, University of Iowa, Iowa City, Iowa, United States of America
- Roy J. Carver Center for Comparative Genomics, University of Iowa, Iowa City, Iowa, United States of America
| | - David J Patterson
- Bay Paul Center for Genomics, Marine Biological Laboratory, Woods Hole, Massachusetts, United States of America
| | - Laura A Katz
- Program in Organismic and Evolutionary Biology, University of Massachusetts, Amherst, Massachusetts, United States of America
- Department of Biological Sciences, Smith College, Northampton, Massachusetts, United States of America
- To whom correspondence should be addressed. E-mail:
| |
Collapse
|
49
|
Wang HC, Spencer M, Susko E, Roger AJ. Testing for covarion-like evolution in protein sequences. Mol Biol Evol 2006; 24:294-305. [PMID: 17056642 DOI: 10.1093/molbev/msl155] [Citation(s) in RCA: 62] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The covarion hypothesis of molecular evolution proposes that selective pressures on an amino acid or nucleotide site change through time, thus causing changes of evolutionary rate along the edges of a phylogenetic tree. Several kinds of Markov models for the covarion process have been proposed. One model, proposed by Huelsenbeck (2002), has 2 substitution rate classes: the substitution process at a site can switch between a single variable rate, drawn from a discrete gamma distribution, and a zero invariable rate. A second model, suggested by Galtier (2001), assumes rate switches among an arbitrary number of rate classes but switching to and from the invariable rate class is not allowed. The latter model allows for some sites that do not participate in the rate-switching process. Here we propose a general covarion model that combines features of both models, allowing evolutionary rates not only to switch between variable and invariable classes but also to switch among different rates when they are in a variable state. We have implemented all 3 covarion models in a maximum likelihood framework for amino acid sequences and tested them on 23 protein data sets. We found significant likelihood increases for all data sets for the 3 models, compared with a model that does not allow site-specific rate switches along the tree. Furthermore, we found that the general model fit the data better than the simpler covarion models in the majority of the cases, highlighting the complexity in modeling the covarion process. The general covarion model can be used for comparing tree topologies, molecular dating studies, and the investigation of protein adaptation.
Collapse
Affiliation(s)
- Huai-Chun Wang
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, Canada.
| | | | | | | |
Collapse
|
50
|
Mallatt J, Giribet G. Further use of nearly complete 28S and 18S rRNA genes to classify Ecdysozoa: 37 more arthropods and a kinorhynch. Mol Phylogenet Evol 2006; 40:772-94. [PMID: 16781168 DOI: 10.1016/j.ympev.2006.04.021] [Citation(s) in RCA: 183] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2005] [Revised: 02/28/2006] [Accepted: 04/03/2006] [Indexed: 10/24/2022]
Abstract
This work expands on a study from 2004 by Mallatt, Garey, and Shultz [Mallatt, J.M., Garey, J.R., Shultz, J.W., 2004. Ecdysozoan phylogeny and Bayesian inference: first use of nearly complete 28S and 18S rRNA gene sequences to classify the arthropods and their kin. Mol. Phylogenet. Evol. 31, 178-191] that evaluated the phylogenetic relationships in Ecdysozoa (molting animals), especially arthropods. Here, the number of rRNA gene-sequences was effectively doubled for each major group of arthropods, and sequences from the phylum Kinorhyncha (mud dragons) were also included, bringing the number of ecdysozoan taxa to over 80. The methods emphasized maximum likelihood, Bayesian inference and statistical testing with parametric bootstrapping, but also included parsimony and minimum evolution. Prominent findings from our combined analysis of both genes are as follows. The fundamental subdivisions of Hexapoda (insects and relatives) are Insecta and Entognatha, with the latter consisting of collembolans (springtails) and a clade of proturans plus diplurans. Our rRNA-gene data provide the strongest evidence to date that the sister group of Hexapoda is Branchiopoda (fairy shrimps, tadpole shrimps, etc.), not Malacostraca. The large, Pancrustacea clade (hexapods within a paraphyletic Crustacea) divided into a few basic subclades: hexapods plus branchiopods; cirripedes (barnacles) plus malacostracans (lobsters, crabs, true shrimps, isopods, etc.); and the basally located clades of (a) ostracods (seed shrimps) and (b) branchiurans (fish lice) plus the bizarre pentastomids (tongue worms). These findings about Pancrustacea agree with a recent study by Regier, Shultz, and Kambic that used entirely different genes [Regier, J.C., Shultz, J.W., Kambic, R.E., 2005a. Pancrustacean phylogeny: hexapods are terrestrial crustaceans and maxillopods are not monophyletic. Proc. R. Soc. B 272, 395-401]. In Malacostraca, the stomatopod (mantis shrimp) was not at the base of the eumalacostracans, as is widely claimed, but grouped instead with an euphausiacean (krill). Within centipedes, Craterostigmus was the sister to all other pleurostigmophorans, contrary to the consensus view. Our trees also united myriapods (millipedes and centipedes) with chelicerates (horseshoe crabs, spiders, scorpions, and relatives) and united pycnogonids (sea spiders) with chelicerates, but with much less support than in the previous rRNA-gene study. Finally, kinorhynchs joined priapulans (penis worms) at the base of Ecdysozoa.
Collapse
Affiliation(s)
- Jon Mallatt
- School of Biological Sciences, Washington State University, Pullman, 99164-4236, USA.
| | | |
Collapse
|