1
|
Szánthó LL, Lartillot N, Szöllősi GJ, Schrempf D. Compositionally Constrained Sites Drive Long-Branch Attraction. Syst Biol 2023; 72:767-780. [PMID: 36946562 PMCID: PMC10405358 DOI: 10.1093/sysbio/syad013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2022] [Revised: 03/01/2023] [Accepted: 03/16/2023] [Indexed: 03/23/2023] Open
Abstract
Accurate phylogenies are fundamental to our understanding of the pattern and process of evolution. Yet, phylogenies at deep evolutionary timescales, with correspondingly long branches, have been fraught with controversy resulting from conflicting estimates from models with varying complexity and goodness of fit. Analyses of historical as well as current empirical datasets, such as alignments including Microsporidia, Nematoda, or Platyhelminthes, have demonstrated that inadequate modeling of across-site compositional heterogeneity, which is the result of biochemical constraints that lead to varying patterns of accepted amino acids along sequences, can lead to erroneous topologies that are strongly supported. Unfortunately, models that adequately account for across-site compositional heterogeneity remain computationally challenging or intractable for an increasing fraction of contemporary datasets. Here, we introduce "compositional constraint analysis," a method to investigate the effect of site-specific constraints on amino acid composition on phylogenetic inference. We show that more constrained sites with lower diversity and less constrained sites with higher diversity exhibit ostensibly conflicting signals under models ignoring across-site compositional heterogeneity that lead to long-branch attraction artifacts and demonstrate that more complex models accounting for across-site compositional heterogeneity can ameliorate this bias. We present CAT-posterior mean site frequencies (PMSF), a pipeline for diagnosing and resolving phylogenetic bias resulting from inadequate modeling of across-site compositional heterogeneity based on the CAT model. CAT-PMSF is robust against long-branch attraction in all alignments we have examined. We suggest using CAT-PMSF when convergence of the CAT model cannot be assured. We find evidence that compositionally constrained sites are driving long-branch attraction in two metazoan datasets and recover evidence for Porifera as the sister group to all other animals. [Animal phylogeny; cross-site heterogeneity; long-branch attraction; phylogenomics.].
Collapse
Affiliation(s)
- Lénárd L Szánthó
- Department of Biological Physics, Eötvös University, Budapest, Hungary
- ELTE-MTA “Lendület” Evolutionary Genomics Research Group, Budapest, Hungary
- Institute of Evolution, Centre for Ecological Research, Budapest, Hungary
| | - Nicolas Lartillot
- Laboratoire de Biométrie et Biologie Evolutive UMR 5558, CNRS, Université de Lyon, Villeurbanne, France
| | - Gergely J Szöllősi
- Department of Biological Physics, Eötvös University, Budapest, Hungary
- ELTE-MTA “Lendület” Evolutionary Genomics Research Group, Budapest, Hungary
- Institute of Evolution, Centre for Ecological Research, Budapest, Hungary
| | - Dominik Schrempf
- Department of Biological Physics, Eötvös University, Budapest, Hungary
| |
Collapse
|
2
|
Goremykin V. Assessment of Absolute Substitution Model Fit Accommodating Time-Reversible and Non-Time-Reversible Evolutionary Processes. Syst Biol 2022:6632685. [PMID: 35792853 DOI: 10.1093/sysbio/syac046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2021] [Revised: 06/16/2022] [Accepted: 06/24/2022] [Indexed: 11/13/2022] Open
Abstract
The loss of information accompanying assessment of absolute fit of substitution models to phylogenetic data negatively affects the discriminatory power of previous methods and can make them insensitive to lineage-specific changes in the substitution process. As an alternative, I propose evaluating absolute fit of substitution models based on a novel statistic which describes the observed data without information loss and which is unlikely to become zero-inflated with increasing numbers of taxa. This method can accommodate gaps and is sensitive to lineage-specific shifts in the substitution process. In simulation experiments, it exhibits greater discriminatory power than previous methods. The method can be implemented in both Bayesian and Maximum Likelihood phylogenetic analyses, and used to screen any set of models. Recently, it has been suggested that model selection may be an unnecessary step in phylogenetic inference. However, results presented here emphasize the importance of model fit assessment for reliable phylogenetic inference.
Collapse
Affiliation(s)
- Vadim Goremykin
- Research and Innovation Centre, Fondazione Edmund Mach, 38010 San Michele all'Adige (TN), Italy
| |
Collapse
|
3
|
Cho A, Tikhonenkov DV, Hehenberger E, Karnkowska A, Mylnikov AP, Keeling PJ. Monophyly of Diverse Bigyromonadea and their Impact on Phylogenomic Relationships Within Stramenopiles. Mol Phylogenet Evol 2022; 171:107468. [DOI: 10.1016/j.ympev.2022.107468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2021] [Revised: 02/11/2022] [Accepted: 02/22/2022] [Indexed: 10/18/2022]
|
4
|
Williams TA, Schrempf D, Szöllősi GJ, Cox CJ, Foster PG, Embley TM. Inferring the deep past from molecular data. Genome Biol Evol 2021; 13:6192802. [PMID: 33772552 PMCID: PMC8175050 DOI: 10.1093/gbe/evab067] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/22/2021] [Indexed: 12/17/2022] Open
Abstract
There is an expectation that analyses of molecular sequences might be able to distinguish between alternative hypotheses for ancient relationships, but the phylogenetic methods used and types of data analyzed are of critical importance in any attempt to recover historical signal. Here, we discuss some common issues that can influence the topology of trees obtained when using overly simple models to analyze molecular data that often display complicated patterns of sequence heterogeneity. To illustrate our discussion, we have used three examples of inferred relationships which have changed radically as models and methods of analysis have improved. In two of these examples, the sister-group relationship between thermophilic Thermus and mesophilic Deinococcus, and the position of long-branch Microsporidia among eukaryotes, we show that recovering what is now generally considered to be the correct tree is critically dependent on the fit between model and data. In the third example, the position of eukaryotes in the tree of life, the hypothesis that is currently supported by the best available methods is fundamentally different from the classical view of relationships between major cellular domains. Since heterogeneity appears to be pervasive and varied among all molecular sequence data, and even the best available models can still struggle to deal with some problems, the issues we discuss are generally relevant to phylogenetic analyses. It remains essential to maintain a critical attitude to all trees as hypotheses of relationship that may change with more data and better methods.
Collapse
Affiliation(s)
- Tom A Williams
- School of Biological Sciences, University of Bristol, Bristol BS8 1TQ, United Kingdom
| | - Dominik Schrempf
- Dept. of Biological Physics, Eötvös Loránd University, 1117 Budapest, Hungary
| | - Gergely J Szöllősi
- Dept. of Biological Physics, Eötvös Loránd University, 1117 Budapest, Hungary.,MTA-ELTE "Lendület" Evolutionary Genomics Research Group, 1117 Budapest, Hungary.,Institute of Evolution, Centre for Ecological Research, 1121 Budapest, Hungary
| | - Cymon J Cox
- Centro de Ciências do Mar, Universidade do Algarve, Gambelas, 8005-319 Faro, Portugal
| | - Peter G Foster
- Department of Life Sciences, Natural History Museum, London SW7 5BD, United Kingdom
| | - T Martin Embley
- Biosciences Institute, Centre for Bacterial Cell Biology, Newcastle University, Newcastle upon Tyne NE2 4AX, United Kingdom
| |
Collapse
|
5
|
Abstract
Knowing phylogenetic relationships among species is fundamental for many studies in biology. An accurate phylogenetic tree underpins our understanding of the major transitions in evolution, such as the emergence of new body plans or metabolism, and is key to inferring the origin of new genes, detecting molecular adaptation, understanding morphological character evolution and reconstructing demographic changes in recently diverged species. Although data are ever more plentiful and powerful analysis methods are available, there remain many challenges to reliable tree building. Here, we discuss the major steps of phylogenetic analysis, including identification of orthologous genes or proteins, multiple sequence alignment, and choice of substitution models and inference methodologies. Understanding the different sources of errors and the strategies to mitigate them is essential for assembling an accurate tree of life.
Collapse
|
6
|
Cedrola F, Senra MVX, Rossi MF, Fregulia P, D’Agosto M, Dias RJP. Trichostomatid Ciliates (Alveolata, Ciliophora, Trichostomatia) Systematics and Diversity: Past, Present, and Future. Front Microbiol 2020; 10:2967. [PMID: 32010077 PMCID: PMC6974537 DOI: 10.3389/fmicb.2019.02967] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2019] [Accepted: 12/09/2019] [Indexed: 01/14/2023] Open
Abstract
The gastrointestinal tracts of most herbivorous mammals are colonized by symbiotic ciliates of the subclass Trichostomatia, which form a well-supported monophyletic group, currently composed by ∼1,000 species, 129 genera, and 21 families, distributed into three orders, Entodiniomorphida, Macropodiniida, and Vestibuliferida. In recent years, trichostomatid ciliates have been playing a part in many relevant functional studies, such as those focusing in host feeding efficiency optimization and those investigating their role in the gastrointestinal methanogenesis, as many trichostomatids are known to establish endosymbiotic associations with methanogenic Archaea. However, the systematics of trichostomatids presents many inconsistencies. Here, we stress the importance of more taxonomic works, to improve classification schemes of this group of organisms, preparing the ground to proper development of such relevant applied works. We will present a historical review of the systematics of the subclass Trichostomatia highlighting taxonomic problems and inconsistencies. Further on, we will discuss possible solutions to these issues and propose future directions to leverage our comprehension about taxonomy and evolution of these symbiotic microeukaryotes.
Collapse
Affiliation(s)
- Franciane Cedrola
- Laboratório de Protozoologia, Programa de Pós-graduação em Comportamento e Biologia Animal, Instituto de Ciências Biológicas, Universidade Federal de Juiz de Fora, Juiz de Fora, Brazil
| | - Marcus Vinicius Xavier Senra
- Laboratório de Protozoologia, Programa de Pós-graduação em Comportamento e Biologia Animal, Instituto de Ciências Biológicas, Universidade Federal de Juiz de Fora, Juiz de Fora, Brazil
- Instituto de Recursos Naturais Renováveis, Universidade Federal de Itajubá, Itajubá, Brazil
| | - Mariana Fonseca Rossi
- Laboratório de Protozoologia, Programa de Pós-graduação em Comportamento e Biologia Animal, Instituto de Ciências Biológicas, Universidade Federal de Juiz de Fora, Juiz de Fora, Brazil
| | - Priscila Fregulia
- Laboratório de Protozoologia, Programa de Pós-graduação em Comportamento e Biologia Animal, Instituto de Ciências Biológicas, Universidade Federal de Juiz de Fora, Juiz de Fora, Brazil
| | - Marta D’Agosto
- Laboratório de Protozoologia, Programa de Pós-graduação em Comportamento e Biologia Animal, Instituto de Ciências Biológicas, Universidade Federal de Juiz de Fora, Juiz de Fora, Brazil
| | - Roberto Júnio Pedroso Dias
- Laboratório de Protozoologia, Programa de Pós-graduação em Comportamento e Biologia Animal, Instituto de Ciências Biológicas, Universidade Federal de Juiz de Fora, Juiz de Fora, Brazil
| |
Collapse
|
7
|
Crotty SM, Minh BQ, Bean NG, Holland BR, Tuke J, Jermiin LS, Haeseler AV. GHOST: Recovering Historical Signal from Heterotachously Evolved Sequence Alignments. Syst Biol 2019; 69:249-264. [DOI: 10.1093/sysbio/syz051] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2017] [Revised: 07/18/2019] [Accepted: 07/22/2019] [Indexed: 01/01/2023] Open
Abstract
Abstract
Molecular sequence data that have evolved under the influence of heterotachous evolutionary processes are known to mislead phylogenetic inference. We introduce the General Heterogeneous evolution On a Single Topology (GHOST) model of sequence evolution, implemented under a maximum-likelihood framework in the phylogenetic program IQ-TREE (http://www.iqtree.org). Simulations show that using the GHOST model, IQ-TREE can accurately recover the tree topology, branch lengths, and substitution model parameters from heterotachously evolved sequences. We investigate the performance of the GHOST model on empirical data by sampling phylogenomic alignments of varying lengths from a plastome alignment. We then carry out inference under the GHOST model on a phylogenomic data set composed of 248 genes from 16 taxa, where we find the GHOST model concurs with the currently accepted view, placing turtles as a sister lineage of archosaurs, in contrast to results obtained using traditional variable rates-across-sites models. Finally, we apply the model to a data set composed of a sodium channel gene of 11 fish taxa, finding that the GHOST model is able to elucidate a subtle component of the historical signal, linked to the previously established convergent evolution of the electric organ in two geographically distinct lineages of electric fish. We compare inference under the GHOST model to partitioning by codon position and show that, owing to the minimization of model constraints, the GHOST model offers unique biological insights when applied to empirical data.
Collapse
Affiliation(s)
- Stephen M Crotty
- Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, University of Vienna and Medical University of Vienna, Vienna, Austria
- School of Mathematical Sciences, University of Adelaide, Adelaide, SA 5005, Australia
| | - Bui Quang Minh
- Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, University of Vienna and Medical University of Vienna, Vienna, Austria
- Research School of Biology, Australian National University, Canberra, ACT 2601, Australia
| | - Nigel G Bean
- School of Mathematical Sciences, University of Adelaide, Adelaide, SA 5005, Australia
- ARC Centre of Excellence for Mathematical and Statistical Frontiers, The University of Adelaide, Adelaide, SA, Australia
| | - Barbara R Holland
- School of Natural Sciences, University of Tasmania, Hobart, TAS 7001, Australia
| | - Jonathan Tuke
- School of Mathematical Sciences, University of Adelaide, Adelaide, SA 5005, Australia
- ARC Centre of Excellence for Mathematical and Statistical Frontiers, The University of Adelaide, Adelaide, SA, Australia
| | - Lars S Jermiin
- Research School of Biology, Australian National University, Canberra, ACT 2601, Australia
- CSIRO Land & Water, Black Mountain Laboratories, Canberra, ACT 2601, Australia
- School of Biology and Environmental Science, University College Dublin, Belfield, Dublin 4, Ireland
- Earth Institute, University College Dublin, Belfield, Dublin 4, Ireland
| | - Arndt Von Haeseler
- Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, University of Vienna and Medical University of Vienna, Vienna, Austria
- Bioinformatics & Computational Biology, Faculty of Computer Science, University of Vienna, Vienna, Austria
| |
Collapse
|
8
|
Nute M, Saleh E, Warnow T. Evaluating Statistical Multiple Sequence Alignment in Comparison to Other Alignment Methods on Protein Data Sets. Syst Biol 2019; 68:396-411. [PMID: 30329135 PMCID: PMC6472439 DOI: 10.1093/sysbio/syy068] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Revised: 09/27/2018] [Accepted: 10/11/2018] [Indexed: 01/15/2023] Open
Abstract
The estimation of multiple sequence alignments of protein sequences is a basic step in many bioinformatics pipelines, including protein structure prediction, protein family identification, and phylogeny estimation. Statistical coestimation of alignments and trees under stochastic models of sequence evolution has long been considered the most rigorous technique for estimating alignments and trees, but little is known about the accuracy of such methods on biological benchmarks. We report the results of an extensive study evaluating the most popular protein alignment methods as well as the statistical coestimation method BAli-Phy on 1192 protein data sets from established benchmarks as well as on 120 simulated data sets. Our study (which used more than 230 CPU years for the BAli-Phy analyses alone) shows that BAli-Phy has better precision and recall (with respect to the true alignments) than the other alignment methods on the simulated data sets but has consistently lower recall on the biological benchmarks (with respect to the reference alignments) than many of the other methods. In other words, we find that BAli-Phy systematically underaligns when operating on biological sequence data but shows no sign of this on simulated data. There are several potential causes for this change in performance, including model misspecification, errors in the reference alignments, and conflicts between structural alignment and evolutionary alignments, and future research is needed to determine the most likely explanation. We conclude with a discussion of the potential ramifications for each of these possibilities. [BAli-Phy; homology; multiple sequence alignment; protein sequences; structural alignment.]
Collapse
Affiliation(s)
- Michael Nute
- Department of Statistics, University of Illinois at Urbana-Champaign, 725 S Wright St #101, Champaign, IL 61820, USA
| | - Ehsan Saleh
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Ave, Urbana, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Ave, Urbana, IL 61801, USA.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1205 W. Clark St., Urbana, IL 61801, USA.,National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
9
|
Dobrin BH, Zwickl DJ, Sanderson MJ. The prevalence of terraced treescapes in analyses of phylogenetic data sets. BMC Evol Biol 2018; 18:46. [PMID: 29618314 PMCID: PMC5885316 DOI: 10.1186/s12862-018-1162-9] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2017] [Accepted: 03/22/2018] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND The pattern of data availability in a phylogenetic data set may lead to the formation of terraces, collections of equally optimal trees. Terraces can arise in tree space if trees are scored with parsimony or with partitioned, edge-unlinked maximum likelihood. Theory predicts that terraces can be large, but their prevalence in contemporary data sets has never been surveyed. We selected 26 data sets and phylogenetic trees reported in recent literature and investigated the terraces to which the trees would belong, under a common set of inference assumptions. We examined terrace size as a function of the sampling properties of the data sets, including taxon coverage density (the proportion of taxon-by-gene positions with any data present) and a measure of gene sampling "sufficiency". We evaluated each data set in relation to the theoretical minimum gene sampling depth needed to reduce terrace size to a single tree, and explored the impact of the terraces found in replicate trees in bootstrap methods. RESULTS Terraces were identified in nearly all data sets with taxon coverage densities < 0.90. They were not found, however, in high-coverage-density (i.e., ≥ 0.94) transcriptomic and genomic data sets. The terraces could be very large, and size varied inversely with taxon coverage density and with gene sampling sufficiency. Few data sets achieved a theoretical minimum gene sampling depth needed to reduce terrace size to a single tree. Terraces found during bootstrap resampling reduced overall support. CONCLUSIONS If certain inference assumptions apply, trees estimated from empirical data sets often belong to large terraces of equally optimal trees. Terrace size correlates to data set sampling properties. Data sets seldom include enough genes to reduce terrace size to one tree. When bootstrap replicate trees lie on a terrace, statistical support for phylogenetic hypotheses may be reduced. Although some of the published analyses surveyed were conducted with edge-linked inference models (which do not induce terraces), unlinked models have been used and advocated. The present study describes the potential impact of that inference assumption on phylogenetic inference in the context of the kinds of multigene data sets now widely assembled for large-scale tree construction.
Collapse
Affiliation(s)
- Barbara H. Dobrin
- Department of Ecology and Evolutionary Biology, University of Arizona, 1041 E. Lowell St, Tucson, AZ 85721 USA
| | - Derrick J. Zwickl
- Department of Ecology and Evolutionary Biology, University of Arizona, 1041 E. Lowell St, Tucson, AZ 85721 USA
| | - Michael J. Sanderson
- Department of Ecology and Evolutionary Biology, University of Arizona, 1041 E. Lowell St, Tucson, AZ 85721 USA
| |
Collapse
|
10
|
Leger MM, Eme L, Stairs CW, Roger AJ. Demystifying Eukaryote Lateral Gene Transfer (Response to Martin 2017 DOI: 10.1002/bies.201700115). Bioessays 2018; 40:e1700242. [DOI: 10.1002/bies.201700242] [Citation(s) in RCA: 54] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2017] [Revised: 02/06/2018] [Indexed: 12/28/2022]
Affiliation(s)
- Michelle M. Leger
- Institute of Evolutionary Biology (CSIC-UPF); Pg. Marítim de la Barceloneta, Barcelona ES 08003 Spain
| | - Laura Eme
- Department of Cell and Molecular Biology; Science for Life Laboratory; Uppsala University; Box 596, Uppsala SE 751 25 Sweden
| | - Courtney W. Stairs
- Department of Cell and Molecular Biology; Science for Life Laboratory; Uppsala University; Box 596, Uppsala SE 751 25 Sweden
| | - Andrew J. Roger
- Centre for Comparative Genomics and Evolutionary Bioinformatics; Department of Biochemistry and Molecular Biology; Dalhousie University; P.O. Box 15000, Halifax CAN B3H 4R2 Nova Scotia Canada
| |
Collapse
|
11
|
Wang B, Zhang Y, Wei P, Sun M, Ma X, Zhu X. Identification of nuclear low-copy genes and their phylogenetic utility in rosids. Genome 2015; 57:547-54. [PMID: 25761707 DOI: 10.1139/gen-2014-0138] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
By far, the interordinal relationships in rosids remain poorly resolved. Previous studies based on chloroplast, mitochondrial, and nuclear DNA has produced conflicting phylogenetic resolutions that has become a widely concerned problem in recent phylogenetic studies. Here, a total of 96 single-copy nuclear gene loci were identified from the KOG (eukaryotic orthologous groups) database, most of which were first used for phylogenetic analysis of angiosperms. The orthologous sequence datasets from completely sequenced genomes of rosids were assembled for the resolution of the position of the COM (Celastrales-Oxalidales-Malpighiales) clade in rosids. Our analysis revealed strong and consistent support for CM topology (the COM clade as sister to the malvids). Our results will contribute to further exploring the underlying cause of conflict between chloroplast, mitochondrial, and nuclear data. In addition, our study identified a few novel nuclear molecular markers with potential to investigate the deep phylogenetic relationship of plants or other eukaryotic taxonomical groups.
Collapse
Affiliation(s)
- Baohua Wang
- School of Life Sciences, Nantong University, Nantong 226019, China
| | | | | | | | | | | |
Collapse
|
12
|
The influence of taxon sampling on Bayesian divergence time inference under scenarios of rate heterogeneity among lineages. J Theor Biol 2014; 364:31-9. [PMID: 25218869 DOI: 10.1016/j.jtbi.2014.09.004] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2013] [Revised: 08/05/2014] [Accepted: 09/02/2014] [Indexed: 11/20/2022]
Abstract
Although taxon sampling is commonly considered an important issue in phylogenetic inference, it is rarely considered in the Bayesian estimation of divergence times. In fact, the studies conducted to date have presented ambiguous results, and the relevance of taxon sampling for molecular dating remains unclear. In this study, we developed a series of simulations that, after six hundred Bayesian molecular dating analyses, allowed us to evaluate the impact of taxon sampling on chronological estimates under three scenarios of among-lineage rate heterogeneity. The first scenario allowed us to examine the influence of the number of terminals on the age estimates based on a strict molecular clock. The second scenario imposed an extreme example of lineage specific rate variation, and the third scenario permitted extensive rate variation distributed along the branches. We also analyzed empirical data on selected mitochondrial genomes of mammals. Our results showed that in the strict molecular-clock scenario (Case I), taxon sampling had a minor impact on the accuracy of the time estimates, although the precision of the estimates was greater with an increased number of terminals. The effect was similar in the scenario (Case III) based on rate variation distributed among the branches. Only under intensive rate variation among lineages (Case II) taxon sampling did result in biased estimates. The results of an empirical analysis corroborated the simulation findings. We demonstrate that taxonomic sampling affected divergence time inference but that its impact was significant if the rates deviated from those derived for the strict molecular clock. Increased taxon sampling improved the precision and accuracy of the divergence time estimates, but the impact on precision is more relevant. On average, biased estimates were obtained only if lineage rate variation was pronounced.
Collapse
|
13
|
Lanfear R, Calcott B, Kainer D, Mayer C, Stamatakis A. Selecting optimal partitioning schemes for phylogenomic datasets. BMC Evol Biol 2014. [PMID: 24742000 DOI: 10.1186/1472-2148-14-82] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/08/2023] Open
Abstract
BACKGROUND Partitioning involves estimating independent models of molecular evolution for different subsets of sites in a sequence alignment, and has been shown to improve phylogenetic inference. Current methods for estimating best-fit partitioning schemes, however, are only computationally feasible with datasets of fewer than 100 loci. This is a problem because datasets with thousands of loci are increasingly common in phylogenetics. METHODS We develop two novel methods for estimating best-fit partitioning schemes on large phylogenomic datasets: strict and relaxed hierarchical clustering. These methods use information from the underlying data to cluster together similar subsets of sites in an alignment, and build on clustering approaches that have been proposed elsewhere. RESULTS We compare the performance of our methods to each other, and to existing methods for selecting partitioning schemes. We demonstrate that while strict hierarchical clustering has the best computational efficiency on very large datasets, relaxed hierarchical clustering provides scalable efficiency and returns dramatically better partitioning schemes as assessed by common criteria such as AICc and BIC scores. CONCLUSIONS These two methods provide the best current approaches to inferring partitioning schemes for very large datasets. We provide free open-source implementations of the methods in the PartitionFinder software. We hope that the use of these methods will help to improve the inferences made from large phylogenomic datasets.
Collapse
Affiliation(s)
- Robert Lanfear
- Ecology Evolution and Genetics, Research School of Biology, Australian National University, Canberra, ACT, Australia.
| | | | | | | | | |
Collapse
|
14
|
Lanfear R, Calcott B, Kainer D, Mayer C, Stamatakis A. Selecting optimal partitioning schemes for phylogenomic datasets. BMC Evol Biol 2014; 14:82. [PMID: 24742000 PMCID: PMC4012149 DOI: 10.1186/1471-2148-14-82] [Citation(s) in RCA: 426] [Impact Index Per Article: 42.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2013] [Accepted: 04/03/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Partitioning involves estimating independent models of molecular evolution for different subsets of sites in a sequence alignment, and has been shown to improve phylogenetic inference. Current methods for estimating best-fit partitioning schemes, however, are only computationally feasible with datasets of fewer than 100 loci. This is a problem because datasets with thousands of loci are increasingly common in phylogenetics. METHODS We develop two novel methods for estimating best-fit partitioning schemes on large phylogenomic datasets: strict and relaxed hierarchical clustering. These methods use information from the underlying data to cluster together similar subsets of sites in an alignment, and build on clustering approaches that have been proposed elsewhere. RESULTS We compare the performance of our methods to each other, and to existing methods for selecting partitioning schemes. We demonstrate that while strict hierarchical clustering has the best computational efficiency on very large datasets, relaxed hierarchical clustering provides scalable efficiency and returns dramatically better partitioning schemes as assessed by common criteria such as AICc and BIC scores. CONCLUSIONS These two methods provide the best current approaches to inferring partitioning schemes for very large datasets. We provide free open-source implementations of the methods in the PartitionFinder software. We hope that the use of these methods will help to improve the inferences made from large phylogenomic datasets.
Collapse
Affiliation(s)
- Robert Lanfear
- Ecology Evolution and Genetics, Research School of Biology, Australian National University, Canberra, ACT, Australia.
| | | | | | | | | |
Collapse
|
15
|
Lanfear R, Calcott B, Kainer D, Mayer C, Stamatakis A. Selecting optimal partitioning schemes for phylogenomic datasets. BMC Evol Biol 2014. [PMID: 24742000 DOI: 10.6084/m9.figshare.938920] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/08/2023] Open
Abstract
BACKGROUND Partitioning involves estimating independent models of molecular evolution for different subsets of sites in a sequence alignment, and has been shown to improve phylogenetic inference. Current methods for estimating best-fit partitioning schemes, however, are only computationally feasible with datasets of fewer than 100 loci. This is a problem because datasets with thousands of loci are increasingly common in phylogenetics. METHODS We develop two novel methods for estimating best-fit partitioning schemes on large phylogenomic datasets: strict and relaxed hierarchical clustering. These methods use information from the underlying data to cluster together similar subsets of sites in an alignment, and build on clustering approaches that have been proposed elsewhere. RESULTS We compare the performance of our methods to each other, and to existing methods for selecting partitioning schemes. We demonstrate that while strict hierarchical clustering has the best computational efficiency on very large datasets, relaxed hierarchical clustering provides scalable efficiency and returns dramatically better partitioning schemes as assessed by common criteria such as AICc and BIC scores. CONCLUSIONS These two methods provide the best current approaches to inferring partitioning schemes for very large datasets. We provide free open-source implementations of the methods in the PartitionFinder software. We hope that the use of these methods will help to improve the inferences made from large phylogenomic datasets.
Collapse
Affiliation(s)
- Robert Lanfear
- Ecology Evolution and Genetics, Research School of Biology, Australian National University, Canberra, ACT, Australia.
| | | | | | | | | |
Collapse
|
16
|
Bernt M, Braband A, Middendorf M, Misof B, Rota-Stabelli O, Stadler PF. Bioinformatics methods for the comparative analysis of metazoan mitochondrial genome sequences. Mol Phylogenet Evol 2013; 69:320-7. [DOI: 10.1016/j.ympev.2012.09.019] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2012] [Revised: 08/31/2012] [Accepted: 09/17/2012] [Indexed: 01/25/2023]
|
17
|
Fabre PH, Jønsson KA, Douzery EJP. Jumping and gliding rodents: mitogenomic affinities of Pedetidae and Anomaluridae deduced from an RNA-Seq approach. Gene 2013; 531:388-97. [PMID: 23973722 DOI: 10.1016/j.gene.2013.07.059] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2013] [Revised: 07/10/2013] [Accepted: 07/16/2013] [Indexed: 10/26/2022]
Abstract
An RNA-Seq strategy was used to obtain the complete set of protein-coding mitochondrial genes from two rodent taxa. Thanks to the next generation sequencing (NGS) 454 approach, we determined the complete mitochondrial DNA genome from Graphiurus kelleni (Mammalia: Rodentia: Gliridae) and partial mitogenome from Pedetes capensis (Pedetidae), and compared them with published rodent and outgroup mitogenomes. We finished the mitogenome sequencing by a series of amplicons using conserved PCR primers to fill the gaps corresponding to tRNA, rRNA and control regions. Phylogenetic analyses of the mitogenomes suggest a well-supported rodent phylogeny in agreement with nuclear gene trees. Pedetes groups with Anomalurus into the clade Anomaluromorpha, while Graphiurus branches within the squirrel-related clade. Moreover, Pedetes+Anomalurus branch with Castor into the mouse-related clade. Our study demonstrates the utility of NGS for obtaining new mitochondrial genomes as well as the importance of choosing adequate models of sequence evolution to infer the phylogeny of rodents.
Collapse
Affiliation(s)
- Pierre-Henri Fabre
- Institut des Sciences de l'Evolution (ISEM, UMR 5554 UM2-CNRS-IRD), Université Montpellier II, Place Eugène Bataillon - CC 064 - 34095 Montpellier Cedex 5, France; Center for Macroecology Evolution and Climate at the Natural History Museum of Denmark, University of Copenhagen, Universitetsparken, 15, DK-2100 Copenhagen Ø, Denmark
| | | | | |
Collapse
|
18
|
O’Connor TD, Mundy NI. Evolutionary Modeling of Genotype-Phenotype Associations, and Application to Primate Coding and Non-coding mtDNA Rate Variation. Evol Bioinform Online 2013; 9:301-16. [PMID: 23926418 PMCID: PMC3733722 DOI: 10.4137/ebo.s11600] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Variation in substitution rates across a phylogeny can be indicative of shifts in the evolutionary dynamics of a protein or non-protein coding regions. One way to understand these signals is to seek the phenotypic correlates of rate variation. Here, we extended a previously published likelihood method designed to detect evolutionary associations between genotypic evolutionary rate and phenotype over a phylogeny. In simulation with two discrete categories of phenotype, the method has a low false-positive rate and detects greater than 80% of true-positives with a tree length of three or greater and a three-fold or greater change in substitution rate given the phenotype. In addition, we successfully extend the test from two to four phenotype categories and evaluated its performance. We then applied the method to two major hypotheses for rate variation in the mitochondrial genome of primates-longevity and generation time as well as body mass which is correlated with many aspects of life history-using three categories of phenotype through discretization of continuous values. Similar to previous results for mammals, we find that the majority of mitochondrial protein-coding genes show associations consistent with the longevity and body mass predictions and that the predominant signal of association comes from the third codon position. We also found a significant association between maximum lifespan and the evolutionary rate of the control region of the mtDNA. In contrast, 24 protein-coding genes from the nuclear genome do not show a consistent pattern of association, which is inconsistent with the generation time hypothesis. These results show the extended method can robustly identify genotype-phenotype associations up to at least four phenotypic categories, and demonstrate the successful application of the method to study factors affecting neutral evolutionary rate in protein-coding and non-coding loci.
Collapse
Affiliation(s)
- Timothy D. O’Connor
- Department of Genome Sciences, University of Washington, Seattle, WA, 98195, USA
| | - Nicholas I. Mundy
- Department of Zoology, Downing Street, University of Cambridge, Cambridge CB2 3EJ, UK
| |
Collapse
|
19
|
Smith JV, Braun EL, Kimball RT. Ratite nonmonophyly: independent evidence from 40 novel Loci. Syst Biol 2012; 62:35-49. [PMID: 22831877 DOI: 10.1093/sysbio/sys067] [Citation(s) in RCA: 58] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
Large-scale multilocus studies have become common in molecular phylogenetics, but the best way to interpret these studies when their results strongly conflict with prior information about phylogeny remains unclear. An example of such a conflict is provided by the ratites (the large flightless birds of southern land masses, including ostriches, emus, and rheas). Ratite monophyly is strongly supported by both morphological data and many earlier molecular studies and is used as a textbook example of vicariance biogeography. However, recent studies have indicated that ratites are not monophyletic; instead, the volant tinamous nest inside the ratites rather than forming their sister group within the avian superorder Palaeognathae. Large-scale studies can exhibit biases that reflect a number of factors, including limitations in the fit of the evolutionary models used for analyses and problems with sequence alignment, so the unexpected conclusion that ratites are not monophyletic needs to be rigorously evaluated. A rigorous approach to testing novel hypotheses generated by large-scale studies is to collect independent evidence (i.e., excluding the loci and/or traits used to generate the hypotheses). We used 40 nuclear loci not used in previous studies that investigated the relationship among ratites and tinamous. Our results strongly support the recent molecular studies, revealing that the deepest branch within Palaeognathae separates the ostrich from other members of the clade, rather than the traditional hypothesis that separates the tinamous from the ratites. To ensure these results reflected evolutionary history, we examined potential biases in types of loci used, heterotachy, alignment biases, and discordance between gene trees and the species tree. All analyses consistently supported nonmonophyly of the ratites and no confounding biases were identified. This confirmation that ratites are not monophyletic using independent evidence will hopefully stimulate further comparative research on paleognath development and genetics that might reveal the basis of the morphological convergence in these large, flightless birds.
Collapse
Affiliation(s)
- Jordan V Smith
- Department of Biology, University of Florida, P.O. Box 118525, Gainesville, FL 32611, USA
| | | | | |
Collapse
|
20
|
Wu CS, Wang YN, Hsu CY, Lin CP, Chaw SM. Loss of different inverted repeat copies from the chloroplast genomes of Pinaceae and cupressophytes and influence of heterotachy on the evaluation of gymnosperm phylogeny. Genome Biol Evol 2011; 3:1284-95. [PMID: 21933779 PMCID: PMC3219958 DOI: 10.1093/gbe/evr095] [Citation(s) in RCA: 107] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/12/2011] [Indexed: 12/13/2022] Open
Abstract
The relationships among the extant five gymnosperm groups--gnetophytes, Pinaceae, non-Pinaceae conifers (cupressophytes), Ginkgo, and cycads--remain equivocal. To clarify this issue, we sequenced the chloroplast genomes (cpDNAs) from two cupressophytes, Cephalotaxus wilsoniana and Taiwania cryptomerioides, and 53 common chloroplast protein-coding genes from another three cupressophytes, Agathis dammara, Nageia nagi, and Sciadopitys verticillata, and a non-Cycadaceae cycad, Bowenia serrulata. Comparative analyses of 11 conifer cpDNAs revealed that Pinaceae and cupressophytes each lost a different copy of inverted repeats (IRs), which contrasts with the view that the same IR has been lost in all conifers. Based on our structural finding, the character of an IR loss no longer conflicts with the "gnepines" hypothesis (gnetophytes sister to Pinaceae). Chloroplast phylogenomic analyses of amino acid sequences recovered incongruent topologies using different tree-building methods; however, we demonstrated that high heterotachous genes (genes that have highly different rates in different lineages) contributed to the long-branch attraction (LBA) artifact, resulting in incongruence of phylogenomic estimates. Additionally, amino acid compositions appear more heterogeneous in high than low heterotachous genes among the five gymnosperm groups. Removal of high heterotachous genes alleviated the LBA artifact and yielded congruent and robust tree topologies in which gnetophytes and Pinaceae formed a sister clade to cupressophytes (the gnepines hypothesis) and Ginkgo clustered with cycads. Adding more cupressophyte taxa could not improve the accuracy of chloroplast phylogenomics for the five gymnosperm groups. In contrast, removal of high heterotachous genes from data sets is simple and can increase confidence in evaluating the phylogeny of gymnosperms.
Collapse
Affiliation(s)
- Chung-Shien Wu
- Biodiversity Research Center, Academia Sinica, Taipei, Taiwan
| | - Ya-Nan Wang
- School of Forestry and Resource Conservation, National Taiwan University, Taipei, Taiwan
| | - Chi-Yao Hsu
- Biodiversity Research Center, Academia Sinica, Taipei, Taiwan
| | - Ching-Ping Lin
- Biodiversity Research Center, Academia Sinica, Taipei, Taiwan
| | - Shu-Miaw Chaw
- Biodiversity Research Center, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
21
|
Ekman S, Blaalid R. The Devil in the Details: Interactions between the Branch-Length Prior and Likelihood Model Affect Node Support and Branch Lengths in the Phylogeny of the Psoraceae. Syst Biol 2011; 60:541-61. [DOI: 10.1093/sysbio/syr022] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Stefan Ekman
- Museum of Evolution, Uppsala University, Norbyvägen 16, SE-752 36 Uppsala, Sweden
- Department of Biology, University of Bergen, PO Box 7800, N-5020 Bergen, Norway
| | - Rakel Blaalid
- Department of Biology, University of Bergen, PO Box 7800, N-5020 Bergen, Norway
- Department of Biology, University of Oslo, PO Box 1066 Blindern, N-0316 Oslo, Norway
| |
Collapse
|
22
|
Struck TH, Paul C, Hill N, Hartmann S, Hösel C, Kube M, Lieb B, Meyer A, Tiedemann R, Purschke G, Bleidorn C. Phylogenomic analyses unravel annelid evolution. Nature 2011; 471:95-8. [DOI: 10.1038/nature09864] [Citation(s) in RCA: 292] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2010] [Accepted: 01/18/2011] [Indexed: 11/09/2022]
|
23
|
Wang HC, Susko E, Roger AJ. Fast statistical tests for detecting heterotachy in protein evolution. Mol Biol Evol 2011; 28:2305-15. [PMID: 21343603 DOI: 10.1093/molbev/msr050] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
The w statistic introduced by Lockhart et al. (1998. A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages. Mol Biol Evol. 15:1183-1188) is a simple and easily calculated statistic intended to detect heterotachy by comparing amino acid substitution patterns between two monophyletic groups of protein sequences. It is defined as the difference between the fraction of varied sites in both groups and the fraction of varied sites in each group. The w test has been used to distinguish a covarion process from equal rates and rates variation across sites processes. Using simulation we show that the w test is effective for small data sets and for data sets that have low substitution rates in the groups but can have difficulties when these conditions are not met. Using site entropy as a measure of variability of a sequence site, we modify the w statistic to a w' statistic by assigning as varied in one group those sites that are actually varied in both groups but have a large entropy difference. We show that the w' test has more power to detect two kinds of heterotachy processes (covarion and bivariate rate shifts) in large and variable data. We also show that a test of Pearson's correlation of the site entropies between two monophyletic groups can be used to detect heterotachy and has more power than the w' test. Furthermore, we demonstrate that there are settings where the correlation test as well as w and w' tests do not detect heterotachy signals in data simulated under a branch length mixture model. In such cases, it is sometimes possible to detect heterotachy through subselection of appropriate taxa. Finally, we discuss the abilities of the three statistical tests to detect a fourth mode of heterotachy: lineage-specific changes in proportion of variable sites.
Collapse
Affiliation(s)
- Huai-Chun Wang
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, Canada.
| | | | | |
Collapse
|
24
|
Gotzek D, Clarke J, Shoemaker D. Mitochondrial genome evolution in fire ants (Hymenoptera: Formicidae). BMC Evol Biol 2010; 10:300. [PMID: 20929580 PMCID: PMC2958920 DOI: 10.1186/1471-2148-10-300] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2009] [Accepted: 10/07/2010] [Indexed: 01/02/2023] Open
Abstract
Background Complete mitochondrial genome sequences have become important tools for the study of genome architecture, phylogeny, and molecular evolution. Despite the rapid increase in available mitogenomes, the taxonomic sampling often poorly reflects phylogenetic diversity and is often also biased to represent deeper (family-level) evolutionary relationships. Results We present the first fully sequenced ant (Hymenoptera: Formicidae) mitochondrial genomes. We sampled four mitogenomes from three species of fire ants, genus Solenopsis, which represent various evolutionary depths. Overall, ant mitogenomes appear to be typical of hymenopteran mitogenomes, displaying a general A+T-bias. The Solenopsis mitogenomes are slightly more compact than other hymentoperan mitogenomes (~15.5 kb), retaining all protein coding genes, ribosomal, and transfer RNAs. We also present evidence of recombination between the mitogenomes of the two conspecific Solenopsis mitogenomes. Finally, we discuss potential ways to improve the estimation of phylogenies using complete mitochondrial genome sequences. Conclusions The ant mitogenome presents an important addition to the continued efforts in studying hymenopteran mitogenome architecture, evolution, and phylogenetics. We provide further evidence that the sampling across many taxonomic levels (including conspecifics and congeners) is useful and important to gain detailed insights into mitogenome evolution. We also discuss ways that may help improve the use of mitogenomes in phylogenetic analyses by accounting for non-stationary and non-homogeneous evolution among branches.
Collapse
Affiliation(s)
- Dietrich Gotzek
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.
| | | | | |
Collapse
|
25
|
Schwartz RS, Mueller RL. Limited effects of among-lineage rate variation on the phylogenetic performance of molecular markers. Mol Phylogenet Evol 2010; 54:849-56. [PMID: 20045073 DOI: 10.1016/j.ympev.2009.12.025] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2009] [Revised: 12/03/2009] [Accepted: 12/24/2009] [Indexed: 10/20/2022]
Abstract
Variation in substitution rates among evolutionary lineages (among-lineage rate variation or ALRV) has been reported to negatively affect the estimation of phylogenies. When the substitution processes underlying ALRV are modeled inadequately, non-sister taxa with similar substitution rates are estimated incorrectly as sister species due to long-branch attraction. Recent advances in modeling site-specific rate variation (heterotachy) have reduced the impacts of ALRV on phylogeny estimation in several empirical and simulated datasets. However, the addition of parameters to the substitution model reduces power to estimate each parameter correctly, which can also lead to incorrect phylogeny estimation. A potential solution to this problem is to identify the levels of ALRV that negatively impact phylogeny estimation such that molecular markers with non-deleterious levels of ALRV can be identified. To this end, we used analyses of empirical and simulated gene datasets to evaluate whether levels of ALRV identified in a mitochondrial genomic dataset for salamanders negatively impacted phylogeny estimation. We simulated data with and without ALRV, holding all other evolutionary parameters constant, and compared the phylogenetic performance of both simulated and empirical datasets. Overall, we found limited, positive effects of ALRV on phylogeny estimation in this dataset, the majority of which resulted from an increase in substitution rate on short branches. We conclude that ALRV does not always negatively impact phylogeny estimation. Therefore, ALRV can likely be disregarded as a criterion for marker selection in comparable phylogenetic studies.
Collapse
Affiliation(s)
- Rachel S Schwartz
- Department of Biology, Colorado State University, Fort Collins, CO 80523-1878, USA.
| | | |
Collapse
|
26
|
Zhou Y, Brinkmann H, Rodrigue N, Lartillot N, Philippe H. A Dirichlet Process Covarion Mixture Model and Its Assessments Using Posterior Predictive Discrepancy Tests. Mol Biol Evol 2009; 27:371-84. [DOI: 10.1093/molbev/msp248] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
27
|
Wang HC, Susko E, Roger AJ. PROCOV: maximum likelihood estimation of protein phylogeny under covarion models and site-specific covarion pattern analysis. BMC Evol Biol 2009; 9:225. [PMID: 19737395 PMCID: PMC2758850 DOI: 10.1186/1471-2148-9-225] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2009] [Accepted: 09/08/2009] [Indexed: 11/12/2022] Open
Abstract
Background The covarion hypothesis of molecular evolution holds that selective pressures on a given amino acid or nucleotide site are dependent on the identity of other sites in the molecule that change throughout time, resulting in changes of evolutionary rates of sites along the branches of a phylogenetic tree. At the sequence level, covarion-like evolution at a site manifests as conservation of nucleotide or amino acid states among some homologs where the states are not conserved in other homologs (or groups of homologs). Covarion-like evolution has been shown to relate to changes in functions at sites in different clades, and, if ignored, can adversely affect the accuracy of phylogenetic inference. Results PROCOV (protein covarion analysis) is a software tool that implements a number of previously proposed covarion models of protein evolution for phylogenetic inference in a maximum likelihood framework. Several algorithmic and implementation improvements in this tool over previous versions make computationally expensive tree searches with covarion models more efficient and analyses of large phylogenomic data sets tractable. PROCOV can be used to identify covarion sites by comparing the site likelihoods under the covarion process to the corresponding site likelihoods under a rates-across-sites (RAS) process. Those sites with the greatest log-likelihood difference between a 'covarion' and an RAS process were found to be of functional or structural significance in a dataset of bacterial and eukaryotic elongation factors. Conclusion Covarion models implemented in PROCOV may be especially useful for phylogenetic estimation when ancient divergences between sequences have occurred and rates of evolution at sites are likely to have changed over the tree. It can also be used to study lineage-specific functional shifts in protein families that result in changes in the patterns of site variability among subtrees.
Collapse
Affiliation(s)
- Huai-Chun Wang
- Department of Mathematics and Statistics, Dalhousie University, Halifax, NS, Canada.
| | | | | |
Collapse
|
28
|
O'Connor TD, Mundy NI. Genotype-phenotype associations: substitution models to detect evolutionary associations between phenotypic variables and genotypic evolutionary rate. ACTA ACUST UNITED AC 2009; 25:i94-100. [PMID: 19478022 PMCID: PMC2687985 DOI: 10.1093/bioinformatics/btp231] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Motivation: Mapping between genotype and phenotype is one of the primary goals of evolutionary genetics but one that has received little attention at the interspecies level. Recent developments in phylogenetics and statistical modelling have typically been used to examine molecular and phenotypic evolution separately. We have used this background to develop phylogenetic substitution models to test for associations between evolutionary rate of genotype and phenotype. We do this by creating hybrid rate matrices between genotype and phenotype. Results: Simulation results show our models to be accurate in detecting genotype–phenotype associations and robust for various factors that typically affect maximum likelihood methods, such as number of taxa, level of relevant signal, proportion of sites affected and length of evolutionary divergence. Further, simulations show that our method is robust to homogeneity assumptions. We apply the models to datasets of male reproductive system genes in relation to mating systems of primates. We show that evolution of semenogelin II is significantly associated with mating systems whereas two negative control genes (cytochrome b and peptidase inhibitor 3) show no significant association. This provides the first hybrid substitution model of which we are aware to directly test the association between genotype and phenotype using a phylogenetic framework. Availability: Perl and HYPHY scripts are available upon request from the authors. Contact:to252@cam.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
|
29
|
Pagel M, Meade A. Modelling heterotachy in phylogenetic inference by reversible-jump Markov chain Monte Carlo. Philos Trans R Soc Lond B Biol Sci 2009; 363:3955-64. [PMID: 18852097 DOI: 10.1098/rstb.2008.0178] [Citation(s) in RCA: 74] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The rate at which a given site in a gene sequence alignment evolves over time may vary. This phenomenon--known as heterotachy--can bias or distort phylogenetic trees inferred from models of sequence evolution that assume rates of evolution are constant. Here, we describe a phylogenetic mixture model designed to accommodate heterotachy. The method sums the likelihood of the data at each site over more than one set of branch lengths on the same tree topology. A branch-length set that is best for one site may differ from the branch-length set that is best for some other site, thereby allowing different sites to have different rates of change throughout the tree. Because rate variation may not be present in all branches, we use a reversible-jump Markov chain Monte Carlo algorithm to identify those branches in which reliable amounts of heterotachy occur. We implement the method in combination with our 'pattern-heterogeneity' mixture model, applying it to simulated data and five published datasets. We find that complex evolutionary signals of heterotachy are routinely present over and above variation in the rate or pattern of evolution across sites, that the reversible-jump method requires far fewer parameters than conventional mixture models to describe it, and serves to identify the regions of the tree in which heterotachy is most pronounced. The reversible-jump procedure also removes the need for a posteriori tests of 'significance' such as the Akaike or Bayesian information criterion tests, or Bayes factors. Heterotachy has important consequences for the correct reconstruction of phylogenies as well as for tests of hypotheses that rely on accurate branch-length information. These include molecular clocks, analyses of tempo and mode of evolution, comparative studies and ancestral state reconstruction. The model is available from the authors' website, and can be used for the analysis of both nucleotide and morphological data.
Collapse
Affiliation(s)
- Mark Pagel
- School of Biological Sciences, University of Reading, Lyle Building, Whiteknights, Reading RG6 6AJ, UK.
| | | |
Collapse
|
30
|
Altaba CR. Universal artifacts affect the branching of phylogenetic trees, not universal scaling laws. PLoS One 2009; 4:e4611. [PMID: 19242549 PMCID: PMC2644784 DOI: 10.1371/journal.pone.0004611] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2008] [Accepted: 01/21/2009] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND The superficial resemblance of phylogenetic trees to other branching structures allows searching for macroevolutionary patterns. However, such trees are just statistical inferences of particular historical events. Recent meta-analyses report finding regularities in the branching pattern of phylogenetic trees. But is this supported by evidence, or are such regularities just methodological artifacts? If so, is there any signal in a phylogeny? METHODOLOGY In order to evaluate the impact of polytomies and imbalance on tree shape, the distribution of all binary and polytomic trees of up to 7 taxa was assessed in tree-shape space. The relationship between the proportion of outgroups and the amount of imbalance introduced with them was assessed applying four different tree-building methods to 100 combinations from a set of 10 ingroup and 9 outgroup species, and performing covariance analyses. The relevance of this analysis was explored taking 61 published phylogenies, based on nucleic acid sequences and involving various taxa, taxonomic levels, and tree-building methods. PRINCIPAL FINDINGS All methods of phylogenetic inference are quite sensitive to the artifacts introduced by outgroups. However, published phylogenies appear to be subject to a rather effective, albeit rather intuitive control against such artifacts. The data and methods used to build phylogenetic trees are varied, so any meta-analysis is subject to pitfalls due to their uneven intrinsic merits, which translate into artifacts in tree shape. The binary branching pattern is an imposition of methods, and seldom reflects true relationships in intraspecific analyses, yielding artifactual polytomies in short trees. Above the species level, the departure of real trees from simplistic random models is caused at least by two natural factors--uneven speciation and extinction rates; and artifacts such as choice of taxa included in the analysis, and imbalance introduced by outgroups and basal paraphyletic taxa. This artifactual imbalance accounts for tree shape convergence of large trees. SIGNIFICANCE There is no evidence for any universal scaling in the tree of life. Instead, there is a need for improved methods of tree analysis that can be used to discriminate the noise due to outgroups from the phylogenetic signal within the taxon of interest, and to evaluate realistic models of evolution, correcting the retrospective perspective and explicitly recognizing extinction as a driving force. Artifacts are pervasive, and can only be overcome through understanding the structure and biological meaning of phylogenetic trees. Catalan Abstract in Translation S1.
Collapse
Affiliation(s)
- Cristian R Altaba
- Laboratory of Human Systematics, University of the Balearic Islands, Balearic Islands, Spain.
| |
Collapse
|
31
|
Park JM, Manen JF, Colwell AE, Schneeweiss GM. A plastid gene phylogeny of the non-photosynthetic parasitic Orobanche (Orobanchaceae) and related genera. JOURNAL OF PLANT RESEARCH 2008; 121:365-76. [PMID: 18483784 DOI: 10.1007/s10265-008-0169-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/09/2007] [Accepted: 04/18/2008] [Indexed: 05/17/2023]
Abstract
The phylogenetic relationships of the non-photosynthetic Orobanche sensu lato (Orobanchaceae), which includes some of the economically most important parasitic weeds, remain insufficiently understood and controversial. This concerns both the phylogenetic relationships within the genus, in particular its monophyly or lack thereof, and the relationships to other holoparasitic genera such as Cistanche or Conopholis. Here we present the first comprehensive phylogenetic study of this group based on a region from the plastid genome (rps2 gene). Although substitution rates appear to be elevated compared to the photosynthetic members of Orobanchaceae, relationships among the major lineages Cistanche, Conopholis plus Epifagus, Boschniakia rossica (Cham. & Schltdl.) B. Fedtsch., B. himalaica Hook. f. & Thomson, B. hookeri Walp. plus B. strobilacea A. Gray, and Orobanche s. l. remain unresolved. Resolution within Orobanche, however, is much better. In agreement with morphological, cytological and other molecular phylogenetic evidence, five lineages, corresponding to the four traditionally recognised sections (Gymnocaulis, Myzorrhiza, Orobanche, Trionychon) and O. latisquama Reut. ex Boiss. (of sect. Orobanche), can be distinguished. A combined analysis of plastid rps2 and nuclear ITS sequences of the holoparasitic genera results in more resolved and better supported trees, although the relationships among Orobanche s. l., Cistanche, and the clade including the remaining genera is unresolved. Therefore, rps2 is a marker from the plastid genome that is well-suited to be used in combination with other already established nuclear markers for resolving generic relationships of Orobanche and related genera.
Collapse
Affiliation(s)
- Jeong-Mi Park
- Department of Systematic and Evolutionary Botany, University of Vienna, Rennweg 14, 1030 Vienna, Austria
| | | | | | | |
Collapse
|
32
|
Kolaczkowski B, Thornton JW. A mixed branch length model of heterotachy improves phylogenetic accuracy. Mol Biol Evol 2008; 25:1054-66. [PMID: 18319244 PMCID: PMC3299401 DOI: 10.1093/molbev/msn042] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/04/2008] [Indexed: 11/14/2022] Open
Abstract
Evolutionary relationships are typically inferred from molecular sequence data using a statistical model of the evolutionary process. When the model accurately reflects the underlying process, probabilistic phylogenetic methods recover the correct relationships with high accuracy. There is ample evidence, however, that models commonly used today do not adequately reflect real-world evolutionary dynamics. Virtually all contemporary models assume that relatively fast-evolving sites are fast across the entire tree, whereas slower sites always evolve at relatively slower rates. Many molecular sequences, however, exhibit site-specific changes in evolutionary rates, called "heterotachy." Here we examine the accuracy of 2 phylogenetic methods for incorporating heterotachy, the mixed branch length model--which incorporates site-specific rate changes by summing likelihoods over multiple sets of branch lengths on the same tree--and the covarion model, which uses a hidden Markov process to allow sites to switch between variable and invariable as they evolve. Under a variety of simple heterogeneous simulation conditions, the mixed model was dramatically more accurate than homotachous models, which were subject to topological biases as well as biases in branch length estimates. When data were simulated with strong versions of the types of heterotachy observed in real molecular sequences, the mixed branch length model was more accurate than homotachous techniques. Analyses of empirical data sets confirmed that the mixed branch length model can improve phylogenetic accuracy under conditions that cause homotachous models to fail. In contrast, the covarion model did not improve phylogenetic accuracy compared with homotachous models and was sometimes substantially less accurate. We conclude that a mixed branch length approach, although not the solution to all phylogenetic errors, is a valuable strategy for improving the accuracy of inferred trees.
Collapse
|
33
|
Gruenheit N, Lockhart PJ, Steel M, Martin W. Difficulties in testing for covarion-like properties of sequences under the confounding influence of changing proportions of variable sites. Mol Biol Evol 2008; 25:1512-20. [PMID: 18424773 DOI: 10.1093/molbev/msn098] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The covarion (COV)-like properties of sequences are poorly described and their impact on phylogenetic analyses poorly understood. We demonstrate using simulations that, under an evolutionary model where the proportion of variable sites changes in nonadjacent lineages, log likelihood values for rates across site (RAS) and COV models become similar, making models difficult to distinguish. Further, although COV and RAS models provide a great improvement in likelihood scores over a homogeneous model with these simulated data, reconstruction accuracy of tree building is low, suggesting caution when it is suspected that proportions of variable sites differ in different evolutionary lineages. We study the performance of a recently developed contingency test that detects the presence of COV-type evolution modified for protein data. We report that if proportions of variable sites (p(var)) change in a lineage-specific manner such that their distributions in different lineages become sufficiently nonoverlapping, then the contingency test can incorrectly suggest a homogeneous model. Also of concern is the possibility of different proportions of variable sites between the groups being studied. In a study of chloroplast proteins, interpretation of the test is found to be susceptible to different partitioning of taxon groups, making the test very subjective in its implementation. Extreme intergroup differences in the extent of divergence and difference in proportions of variable sites could be contributing to this effect.
Collapse
Affiliation(s)
- Nicole Gruenheit
- Institute of Botany III, University of Düsseldorf, Düsseldorf, Germany.
| | | | | | | |
Collapse
|