1
|
Patané JSL, Martins J, Setubal JC. A Guide to Phylogenomic Inference. Methods Mol Biol 2024; 2802:267-345. [PMID: 38819564 DOI: 10.1007/978-1-0716-3838-5_11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Phylogenomics aims at reconstructing the evolutionary histories of organisms taking into account whole genomes or large fractions of genomes. Phylogenomics has significant applications in fields such as evolutionary biology, systematics, comparative genomics, and conservation genetics, providing valuable insights into the origins and relationships of species and contributing to our understanding of biological diversity and evolution. This chapter surveys phylogenetic concepts and methods aimed at both gene tree and species tree reconstruction while also addressing common pitfalls, providing references to relevant computer programs. A practical phylogenomic analysis example including bacterial genomes is presented at the end of the chapter.
Collapse
Affiliation(s)
- José S L Patané
- Laboratório de Genética e Cardiologia Molecular, Instituto do Coração/Heart Institute Hospital das Clínicas - Faculdade de Medicina da Universidade de São Paulo São Paulo, São Paulo, SP, Brazil
| | - Joaquim Martins
- Integrative Omics group, Biorenewables National Laboratory, Brazilian Center for Research in Energy and Materials, Campinas, SP, Brazil
| | - João Carlos Setubal
- Departmento de Bioquímica, Instituto de Química, Universidade de São Paulo, São Paulo, SP, Brazil.
| |
Collapse
|
2
|
Tabatabaee Y, Zhang C, Warnow T, Mirarab S. Phylogenomic branch length estimation using quartets. Bioinformatics 2023; 39:i185-i193. [PMID: 37387151 PMCID: PMC10311336 DOI: 10.1093/bioinformatics/btad221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Branch lengths and topology of a species tree are essential in most downstream analyses, including estimation of diversification dates, characterization of selection, understanding adaptation, and comparative genomics. Modern phylogenomic analyses often use methods that account for the heterogeneity of evolutionary histories across the genome due to processes such as incomplete lineage sorting. However, these methods typically do not generate branch lengths in units that are usable by downstream applications, forcing phylogenomic analyses to resort to alternative shortcuts such as estimating branch lengths by concatenating gene alignments into a supermatrix. Yet, concatenation and other available approaches for estimating branch lengths fail to address heterogeneity across the genome. RESULTS In this article, we derive expected values of gene tree branch lengths in substitution units under an extension of the multispecies coalescent (MSC) model that allows substitutions with varying rates across the species tree. We present CASTLES, a new technique for estimating branch lengths on the species tree from estimated gene trees that uses these expected values, and our study shows that CASTLES improves on the most accurate prior methods with respect to both speed and accuracy. AVAILABILITY AND IMPLEMENTATION CASTLES is available at https://github.com/ytabatabaee/CASTLES.
Collapse
Affiliation(s)
- Yasamin Tabatabaee
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, United States
| | - Chao Zhang
- Department of Integrative Biology, University of California at Berkeley, Berkeley, CA 94720, United States
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, United States
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California, San Diego, La Jolla, CA 92093, United States
| |
Collapse
|
3
|
Yan Z, Ogilvie HA, Nakhleh L. Comparing inference under the multispecies coalescent with and without recombination. Mol Phylogenet Evol 2023; 181:107724. [PMID: 36720421 DOI: 10.1016/j.ympev.2023.107724] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2022] [Revised: 01/17/2023] [Accepted: 01/24/2023] [Indexed: 01/29/2023]
Abstract
Accurate inference of population parameters plays a pivotal role in unravelling evolutionary histories. While recombination has been universally accepted as a fundamental process in the evolution of sexually reproducing organisms, it remains challenging to model it exactly. Thus, existing coalescent-based approaches make different assumptions or approximations to facilitate phylogenetic inference, which can potentially bring about biases in estimates of evolutionary parameters when recombination is present. In this article, we evaluate the performance of population parameter estimation using three methods-StarBEAST2, SNAPP, and diCal2-that represent three different types of inference. We performed whole-genome simulations in which recombination rates, mutation rates, and levels of incomplete lineage sorting were varied. We show that StarBEAST2 using short or medium-sized loci is robust to realistic rates of recombination, which is in agreement with previous studies. SNAPP, as expected, is generally unaffected by recombination events. Most surprisingly, diCal2, a method that is designed to explicitly account for recombination, performs considerably worse than other methods under comparison.
Collapse
Affiliation(s)
- Zhi Yan
- Department of Computer Science, Rice University, 6100 Main Street, Houston 77005, TX, USA.
| | - Huw A Ogilvie
- Department of Computer Science, Rice University, 6100 Main Street, Houston 77005, TX, USA.
| | - Luay Nakhleh
- Department of Computer Science, Rice University, 6100 Main Street, Houston 77005, TX, USA.
| |
Collapse
|
4
|
Zaharias P, Warnow T. Recent progress on methods for estimating and updating large phylogenies. Philos Trans R Soc Lond B Biol Sci 2022; 377:20210244. [PMID: 35989607 PMCID: PMC9393559 DOI: 10.1098/rstb.2021.0244] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 01/07/2022] [Indexed: 12/20/2022] Open
Abstract
With the increased availability of sequence data and even of fully sequenced and assembled genomes, phylogeny estimation of very large trees (even of hundreds of thousands of sequences) is now a goal for some biologists. Yet, the construction of these phylogenies is a complex pipeline presenting analytical and computational challenges, especially when the number of sequences is very large. In the past few years, new methods have been developed that aim to enable highly accurate phylogeny estimations on these large datasets, including divide-and-conquer techniques for multiple sequence alignment and/or tree estimation, methods that can estimate species trees from multi-locus datasets while addressing heterogeneity due to biological processes (e.g. incomplete lineage sorting and gene duplication and loss), and methods to add sequences into large gene trees or species trees. Here we present some of these recent advances and discuss opportunities for future improvements. This article is part of a discussion meeting issue 'Genomic population structures of microbial pathogens'.
Collapse
Affiliation(s)
- Paul Zaharias
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
5
|
How challenging RADseq data turned out to favor coalescent-based species tree inference. A case study in Aichryson (Crassulaceae). Mol Phylogenet Evol 2021; 167:107342. [PMID: 34785384 DOI: 10.1016/j.ympev.2021.107342] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2020] [Revised: 07/05/2021] [Accepted: 10/29/2021] [Indexed: 12/24/2022]
Abstract
Analysing multiple genomic regions while incorporating detection and qualification of discordance among regions has become standard for understanding phylogenetic relationships. In plants, which usually have comparatively large genomes, this is feasible by the combination of reduced-representation library (RRL) methods and high-throughput sequencing enabling the cost effective acquisition of genomic data for thousands of loci from hundreds of samples. One popular RRL method is RADseq. A major disadvantage of established RADseq approaches is the rather short fragment and sequencing range, leading to loci of little individual phylogenetic information. This issue hampers the application of coalescent-based species tree inference. The modified RADseq protocol presented here targets ca. 5,000 loci of 300-600nt length, sequenced with the latest short-read-sequencing (SRS) technology, has the potential to overcome this drawback. To illustrate the advantages of this approach we use the study group Aichryson Webb & Berthelott (Crassulaceae), a plant genus that diversified on the Canary Islands. The data analysis approach used here aims at a careful quality control of the long loci dataset. It involves an informed selection of thresholds for accurate clustering, a thorough exploration of locus properties, such as locus length, coverage and variability, to identify potential biased data and a comparative phylogenetic inference of filtered datasets, accompanied by an evaluation of resulting BS support, gene and site concordance factor values, to improve overall resolution of the resulting phylogenetic trees. The final dataset contains variable loci with an average length of 373nt and facilitates species tree estimation using a coalescent-based summary approach. Additional improvements brought by the approach are critically discussed.
Collapse
|
6
|
Alanzi AAR, Degnan JH. Statistical inconsistency of the unrooted minimize deep coalescence criterion. PLoS One 2021; 16:e0251107. [PMID: 33970931 PMCID: PMC8109837 DOI: 10.1371/journal.pone.0251107] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2020] [Accepted: 04/20/2021] [Indexed: 11/24/2022] Open
Abstract
Species trees, which describe the evolutionary relationships between species, are often inferred from gene trees, which describe the ancestral relationships between sequences sampled at different loci from the species of interest. A common approach to inferring species trees from gene trees is motivated by supposing that gene tree variation is due to incomplete lineage sorting, also known as deep coalescence. One of the earliest methods motivated by deep coalescence is to find the species tree that minimizes the number of deep coalescent events needed to explain discrepancies between the species tree and input gene trees. This minimize deep coalescence (MDC) criterion can be applied in both rooted and unrooted settings. where either rooted or unrooted gene trees can be used to infer a rooted species tree. Previous work has shown that MDC is statistically inconsistent in the rooted setting, meaning that under a probabilistic model for deep coalescence, the multispecies coalescent, for some species trees, increasing the number of input gene trees does not make the method more likely to return a correct species tree. Here, we obtain analogous results in the unrooted setting, showing conditions leading to inconsistency of the MDC criterion using the multispecies coalescent model with unrooted gene trees for four taxa and five taxa.
Collapse
Affiliation(s)
- Ayed A. R. Alanzi
- Mathematics Department, College of Science and Human Studies of Hotat Sudair, Majmaah University, Majmaah, Saudi Arabia
| | - James H. Degnan
- Department of Mathematics and Statistics, University of New Mexico, Albuquerque, NM, United States of America
- * E-mail:
| |
Collapse
|
7
|
Nydam ML, Lemmon AR, Cherry JR, Kortyna ML, Clancy DL, Hernandez C, Cohen CS. Phylogenomic and morphological relationships among the botryllid ascidians (Subphylum Tunicata, Class Ascidiacea, Family Styelidae). Sci Rep 2021; 11:8351. [PMID: 33863944 PMCID: PMC8052435 DOI: 10.1038/s41598-021-87255-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Accepted: 02/16/2021] [Indexed: 02/02/2023] Open
Abstract
Ascidians (Phylum Chordata, Class Ascidiacea) are a large group of invertebrates which occupy a central role in the ecology of marine benthic communities. Many ascidian species have become successfully introduced around the world via anthropogenic vectors. The botryllid ascidians (Order Stolidobranchia, Family Styelidae) are a group of 53 colonial species, several of which are widespread throughout temperate or tropical and subtropical waters. However, the systematics and biology of this group of ascidians is not well-understood. To provide a systematic framework for this group, we have constructed a well-resolved phylogenomic tree using 200 novel loci and 55 specimens. A Principal Components Analysis of all species described in the literature using 31 taxonomic characteristics revealed that some species occupy a unique morphological space and can be easily identified using characteristics of adult colonies. For other species, additional information such as larval or life history characteristics may be required for taxonomic discrimination. Molecular barcodes are critical for guiding the delineation of morphologically similar species in this group.
Collapse
Affiliation(s)
- Marie L Nydam
- Math and Science Program, Soka University of America, 1 University Drive, Aliso Viejo, CA, 92656, USA.
| | - Alan R Lemmon
- Department of Scientific Computing, Florida State University, 400 Dirac Science Library, Tallahassee, FL, 32306, USA
| | - Jesse R Cherry
- Department of Biological Science, Florida State University, 319 Stadium Drive, Tallahassee, FL, 32306, USA
| | - Michelle L Kortyna
- Department of Biological Science, Florida State University, 319 Stadium Drive, Tallahassee, FL, 32306, USA
| | - Darragh L Clancy
- Biology Department and Estuarine and Ocean Science Center, San Francisco State University, 3150 Paradise Drive, Tiburon, CA, 94920, USA
| | - Cecilia Hernandez
- Biology Department and Estuarine and Ocean Science Center, San Francisco State University, 3150 Paradise Drive, Tiburon, CA, 94920, USA
| | - C Sarah Cohen
- Biology Department and Estuarine and Ocean Science Center, San Francisco State University, 3150 Paradise Drive, Tiburon, CA, 94920, USA
| |
Collapse
|
8
|
Dibaeinia P, Tabe-Bordbar S, Warnow T. FASTRAL: Improving scalability of phylogenomic analysis. Bioinformatics 2021; 37:2317-2324. [PMID: 33576396 PMCID: PMC8388037 DOI: 10.1093/bioinformatics/btab093] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2020] [Revised: 02/02/2021] [Accepted: 02/04/2021] [Indexed: 01/22/2023] Open
Abstract
MOTIVATION ASTRAL is the current leading method for species tree estimation from phylogenomic datasets (i.e., hundreds to thousands of genes) that addresses gene tree discord resulting from incomplete lineage sorting (ILS). ASTRAL is statistically consistent under the multi-locus coalescent model (MSC), runs in polynomial time, and is able to run on large datasets. Key to ASTRAL's algorithm is the use of dynamic programming to find an optimal solution to the MQSST (maximum quartet support supertree) within a constraint space that it computes from the input. Yet, ASTRAL can fail to complete within reasonable timeframes on large datasets with many genes and species, because in these cases the constraint space it computes is too large. RESULTS Here we introduce FASTRAL, a phylogenomic estimation method. FASTRAL is based on ASTRAL, but uses a different technique for constructing the constraint space. The technique we use to define the constraint space maintains statistical consistency and is polynomial time; thus we prove that FASTRAL is a polynomial time algorithm that is statistically consistent under the MSC. Our performance study on both biological and simulated data sets demonstrates that FASTRAL matches or improves on ASTRAL with respect to species tree topology accuracy (and under high ILS conditions it is statistically significantly more accurate), while being dramatically faster-especially on datasets with large numbers of genes and high ILS-due to using a significantly smaller constraint space. AVAILABILITY FASTRAL is available in open-source form at https://github.com/PayamDiba/FASTRAL. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Payam Dibaeinia
- Department of Computer Science, University of Illinois, Urbana, IL 61801, USA
| | - Shayan Tabe-Bordbar
- Department of Computer Science, University of Illinois, Urbana, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois, Urbana, IL 61801, USA,To whom correspondence should be addressed.
| |
Collapse
|
9
|
Mendes FK, Livera AP, Hahn MW. The perils of intralocus recombination for inferences of molecular convergence. Philos Trans R Soc Lond B Biol Sci 2019; 374:20180244. [PMID: 31154973 DOI: 10.1098/rstb.2018.0244] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
Accurate inferences of convergence require that the appropriate tree topology be used. If there is a mismatch between the tree a trait has evolved along and the tree used for analysis, then false inferences of convergence ('hemiplasy') can occur. To avoid problems of hemiplasy when there are high levels of gene tree discordance with the species tree, researchers have begun to construct tree topologies from individual loci. However, due to intralocus recombination, even locus-specific trees may contain multiple topologies within them. This implies that the use of individual tree topologies discordant with the species tree can still lead to incorrect inferences about molecular convergence. Here, we examine the frequency with which single exons and single protein-coding genes contain multiple underlying tree topologies, in primates and Drosophila, and quantify the effects of hemiplasy when using trees inferred from individual loci. In both clades, we find that there are most often multiple diagnosable topologies within single exons and whole genes, with 91% of Drosophila protein-coding genes containing multiple topologies. Because of this underlying topological heterogeneity, even using trees inferred from individual protein-coding genes results in 25% and 38% of substitutions falsely labelled as convergent in primates and Drosophila, respectively. While constructing local trees can reduce the problem of hemiplasy, our results suggest that it will be difficult to completely avoid false inferences of convergence. We conclude by suggesting several ways forward in the analysis of convergent evolution, for both molecular and morphological characters. This article is part of the theme issue 'Convergent evolution in the genomics era: new insights and directions'.
Collapse
Affiliation(s)
- Fábio K Mendes
- 1 Department of Computer Science, The University of Auckland , Auckland 1010 , New Zealand.,2 Department of Biology, Indiana University , Bloomington, IN 47405 , USA
| | - Andrew P Livera
- 2 Department of Biology, Indiana University , Bloomington, IN 47405 , USA
| | - Matthew W Hahn
- 2 Department of Biology, Indiana University , Bloomington, IN 47405 , USA.,3 Department of Computer Science, Indiana University , Bloomington, IN 47405 , USA
| |
Collapse
|
10
|
Liu L, Anderson C, Pearl D, Edwards SV. Modern Phylogenomics: Building Phylogenetic Trees Using the Multispecies Coalescent Model. Methods Mol Biol 2019; 1910:211-239. [PMID: 31278666 DOI: 10.1007/978-1-4939-9074-0_7] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The multispecies coalescent (MSC) model provides a compelling framework for building phylogenetic trees from multilocus DNA sequence data. The pure MSC is best thought of as a special case of so-called "multispecies network coalescent" models, in which gene flow is allowed among branches of the tree, whereas MSC methods assume there is no gene flow between diverging species. Early implementations of the MSC, such as "parsimony" or "democratic vote" approaches to combining information from multiple gene trees, as well as concatenation, in which DNA sequences from multiple gene trees are combined into a single "supergene," were quickly shown to be inconsistent in some regions of tree space, in so far as they converged on the incorrect species tree as more gene trees and sequence data were accumulated. The anomaly zone, a region of tree space in which the most frequent gene tree is different from the species tree, is one such region where many so-called "coalescent" methods are inconsistent. Second-generation implementations of the MSC employed Bayesian or likelihood models; these are consistent in all regions of gene tree space, but Bayesian methods in particular are incapable of handling the large phylogenomic data sets currently available. Two-step methods, such as MP-EST and ASTRAL, in which gene trees are first estimated and then combined to estimate an overarching species tree, are currently popular in part because they can handle large phylogenomic data sets. These methods are consistent in the anomaly zone but can sometimes provide inappropriate measures of tree support or apportion error and signal in the data inappropriately. MP-EST in particular employs a likelihood model which can be conveniently manipulated to perform statistical tests of competing species trees, incorporating the likelihood of the collected gene trees on each species tree in a likelihood ratio test. Such tests provide a useful alternative to the multilocus bootstrap, which only indirectly tests the appropriateness of competing species trees. We illustrate these tests and implementations of the MSC with examples and suggest that MSC methods are a useful class of models effectively using information from multiple loci to build phylogenetic trees.
Collapse
Affiliation(s)
- Liang Liu
- Department of Statistics, University of Georgia, Athens, GA, USA
| | | | - Dennis Pearl
- Department of Statistics, Pennsylvania State University, University Park, PA, USA
| | - Scott V Edwards
- Department of Organismic and Evolutionary Biology & Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA.
| |
Collapse
|
11
|
Alda F, Tagliacollo VA, Bernt MJ, Waltz BT, Ludt WB, Faircloth BC, Alfaro ME, Albert JS, Chakrabarty P. Resolving Deep Nodes in an Ancient Radiation of Neotropical Fishes in the Presence of Conflicting Signals from Incomplete Lineage Sorting. Syst Biol 2018; 68:573-593. [DOI: 10.1093/sysbio/syy085] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2018] [Revised: 11/30/2018] [Accepted: 12/03/2018] [Indexed: 12/13/2022] Open
Affiliation(s)
- Fernando Alda
- Museum of Natural Science, Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA
- Department of Biology, Geology and Environmental Science, University of Tennessee at Chattanooga, Chattanooga, TN 37403, USA
| | - Victor A Tagliacollo
- Museu de Zoologia da Universidade de São Paulo (MZUSP), Ipirianga, 04263-000, São Paulo, São Paulo, Brazil
| | - Maxwell J Bernt
- Department of Biology, University of Louisiana at Lafayette, Lafayette, LA 70503, USA
| | - Brandon T Waltz
- Department of Biology, University of Louisiana at Lafayette, Lafayette, LA 70503, USA
| | - William B Ludt
- Museum of Natural Science, Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA
| | - Brant C Faircloth
- Museum of Natural Science, Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA
| | - Michael E Alfaro
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA 90095, USA
| | - James S Albert
- Department of Biology, University of Louisiana at Lafayette, Lafayette, LA 70503, USA
| | - Prosanta Chakrabarty
- Museum of Natural Science, Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA
| |
Collapse
|
12
|
Rabiee M, Sayyari E, Mirarab S. Multi-allele species reconstruction using ASTRAL. Mol Phylogenet Evol 2018; 130:286-296. [PMID: 30393186 DOI: 10.1016/j.ympev.2018.10.033] [Citation(s) in RCA: 86] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2017] [Revised: 10/23/2018] [Accepted: 10/24/2018] [Indexed: 11/29/2022]
Abstract
Genome-wide phylogeny reconstruction is becoming increasingly common, and one driving factor behind these phylogenomic studies is the promise that the potential discordance between gene trees and the species tree can be modeled. Incomplete lineage sorting is one cause of discordance that bridges population genetic and phylogenetic processes. ASTRAL is a species tree reconstruction method that seeks to find the tree with minimum quartet distance to an input set of inferred gene trees. However, the published ASTRAL algorithm only works with one sample per species. To account for polymorphisms in present-day species, one can sample multiple individuals per species to create multi-allele datasets. Here, we introduce how ASTRAL can handle multi-allele datasets. We show that the quartet-based optimization problem extends naturally, and we introduce heuristic methods for building the search space specifically for the case of multi-individual datasets. We study the accuracy and scalability of the multi-individual version of ASTRAL-III using extensive simulation studies and compare it to NJst, the only other scalable method that can handle these datasets. We do not find strong evidence that using multiple individuals dramatically improves accuracy. When we study the trade-off between sampling more genes versus more individuals, we find that sampling more genes is more effective than sampling more individuals, even under conditions that we study where trees are shallow (median length: ≈1Ne) and ILS is extremely high.
Collapse
Affiliation(s)
- Maryam Rabiee
- Department of Computer Science and Engineering, University of California, San Diego, 9500 Gilman Dr, La Jolla, CA 92093, United States
| | - Erfan Sayyari
- Department of Electrical and Computer Engineering, University of California, San Diego, 9500 Gilman Dr, La Jolla, CA 92093, United States
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California, San Diego, 9500 Gilman Dr, La Jolla, CA 92093, United States.
| |
Collapse
|
13
|
Freitas L, Mello B, Schrago CG. Multispecies coalescent analysis confirms standing phylogenetic instability in Hexapoda. J Evol Biol 2018; 31:1623-1631. [PMID: 30058265 DOI: 10.1111/jeb.13355] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2018] [Revised: 06/28/2018] [Accepted: 07/23/2018] [Indexed: 11/28/2022]
Abstract
The multispecies coalescent (MSC) has been increasingly used in phylogenomic analyses due to the accommodation of gene tree topological heterogeneity by taking into account population-level processes, such as incomplete lineage sorting. In this sense, the phylogeny of insect species, which are characterized by their large effective population sizes, is suitable for a coalescent-based analysis. Furthermore, studies so far recovered short internal branches at early divergences of the insect tree of life, indicating fast evolutionary radiations that increase the probability of incomplete lineage sorting in deep time. Here, we investigated the performance of the MSC for a phylogenomic data set of hexapods compiled by Misof et al. (2014, Science 346:763). Our analysis recovered the monophyly of most insect orders, and major phylogenetic relationships were in agreement with current insect systematics. We identified, however, some evolutionary associations that were consistently problematic. Most noticeable, Hexapod monophyly was disrupted by the sister group relationship between the remiped crustacean and Insecta. Additionally, the interordinal relationships within Polyneoptera and Neuropteroidea were found to be phylogenetically unstable. We show that these controversial phylogenetic arrangements were also poorly supported by previous analyses, and therefore, we evaluated their robustness to stochastic errors from sampling sites and terminals, confirming standing problems in hexapod phylogeny in the genomics age.
Collapse
Affiliation(s)
- Lucas Freitas
- Departamento de Genética, Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brazil
| | - Beatriz Mello
- Departamento de Genética, Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brazil
| | - Carlos G Schrago
- Departamento de Genética, Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brazil
| |
Collapse
|
14
|
Sayyari E, Whitfield JB, Mirarab S. Fragmentary Gene Sequences Negatively Impact Gene Tree and Species Tree Reconstruction. Mol Biol Evol 2018; 34:3279-3291. [PMID: 29029241 DOI: 10.1093/molbev/msx261] [Citation(s) in RCA: 61] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Species tree reconstruction from genome-wide data is increasingly being attempted, in most cases using a two-step approach of first estimating individual gene trees and then summarizing them to obtain a species tree. The accuracy of this approach, which promises to account for gene tree discordance, depends on the quality of the inferred gene trees. At the same time, phylogenomic and phylotranscriptomic analyses typically use involved bioinformatics pipelines for data preparation. Errors and shortcomings resulting from these preprocessing steps may impact the species tree analyses at the other end of the pipeline. In this article, we first show that the presence of fragmentary data for some species in a gene alignment, as often seen on real data, can result in substantial deterioration of gene trees, and as a result, the species tree. We then investigate a simple filtering strategy where individual fragmentary sequences are removed from individual genes but the rest of the gene is retained. Both in simulations and by reanalyzing a large insect phylotranscriptomic data set, we show the effectiveness of this simple filtering strategy.
Collapse
Affiliation(s)
- Erfan Sayyari
- Department of Electrical and Computer Engineering, University of California at San Diego, La Jolla, CA
| | | | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California at San Diego, La Jolla, CA
| |
Collapse
|
15
|
SVDquest: Improving SVDquartets species tree estimation using exact optimization within a constrained search space. Mol Phylogenet Evol 2018. [DOI: 10.1016/j.ympev.2018.03.006] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
16
|
Davidson R, Lawhorn M, Rusinko J, Weber N. Efficient Quartet Representations of Trees and Applications to Supertree and Summary Methods. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1010-1015. [PMID: 28113327 DOI: 10.1109/tcbb.2016.2638911] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Quartet trees displayed by larger phylogenetic trees have long been used as inputs for species tree and supertree reconstruction. Computational constraints prevent the use of all displayed quartets in many practical problems with large numbers of taxa. We introduce the notion of an Efficient Quartet System (EQS) to represent a phylogenetic tree with a subset of the quartets displayed by the tree. We show mathematically that the set of quartets obtained from a tree via an EQS contains all of the combinatorial information of the tree itself. Using performance tests on simulated datasets, we also demonstrate that using an EQS to reduce the number of quartets in both summary method pipelines for species tree inference as well as methods for supertree inference results in only small reductions in accuracy.
Collapse
|
17
|
Ogilvie HA, Bouckaert RR, Drummond AJ. StarBEAST2 Brings Faster Species Tree Inference and Accurate Estimates of Substitution Rates. Mol Biol Evol 2018; 34:2101-2114. [PMID: 28431121 PMCID: PMC5850801 DOI: 10.1093/molbev/msx126] [Citation(s) in RCA: 256] [Impact Index Per Article: 42.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
Fully Bayesian multispecies coalescent (MSC) methods like *BEAST estimate species trees from multiple sequence alignments. Today thousands of genes can be sequenced for a given study, but using that many genes with *BEAST is intractably slow. An alternative is to use heuristic methods which compromise accuracy or completeness in return for speed. A common heuristic is concatenation, which assumes that the evolutionary history of each gene tree is identical to the species tree. This is an inconsistent estimator of species tree topology, a worse estimator of divergence times, and induces spurious substitution rate variation when incomplete lineage sorting is present. Another class of heuristics directly motivated by the MSC avoids many of the pitfalls of concatenation but cannot be used to estimate divergence times. To enable fuller use of available data and more accurate inference of species tree topologies, divergence times, and substitution rates, we have developed a new version of *BEAST called StarBEAST2. To improve convergence rates we add analytical integration of population sizes, novel MCMC operators and other optimizations. Computational performance improved by 13.5× and 13.8× respectively when analyzing two empirical data sets, and an average of 33.1× across 30 simulated data sets. To enable accurate estimates of per-species substitution rates, we introduce species tree relaxed clocks, and show that StarBEAST2 is a more powerful and robust estimator of rate variation than concatenation. StarBEAST2 is available through the BEAUTi package manager in BEAST 2.4 and above.
Collapse
Affiliation(s)
- Huw A Ogilvie
- Division of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, Australia.,Centre for Computational Evolution, University of Auckland, Auckland, New Zealand
| | - Remco R Bouckaert
- Centre for Computational Evolution, University of Auckland, Auckland, New Zealand.,Department of Computer Science, University of Auckland, Auckland, New Zealand
| | - Alexei J Drummond
- Centre for Computational Evolution, University of Auckland, Auckland, New Zealand.,Department of Computer Science, University of Auckland, Auckland, New Zealand
| |
Collapse
|
18
|
Abstract
Phylogenomics aims at reconstructing the evolutionary histories of organisms taking into account whole genomes or large fractions of genomes. The abundance of genomic data for an enormous variety of organisms has enabled phylogenomic inference of many groups, and this has motivated the development of many computer programs implementing the associated methods. This chapter surveys phylogenetic concepts and methods aimed at both gene tree and species tree reconstruction while also addressing common pitfalls, providing references to relevant computer programs. A practical phylogenomic analysis example including bacterial genomes is presented at the end of the chapter.
Collapse
Affiliation(s)
- José S L Patané
- Department of Biochemistry, Institute of Chemistry, University of São Paulo, Av. Prof. Lineu Prestes 748, São Paulo, SP, 05508-000, Brazil
| | - Joaquim Martins
- Department of Biochemistry, Institute of Chemistry, University of São Paulo, Av. Prof. Lineu Prestes 748, São Paulo, SP, 05508-000, Brazil
| | - João C Setubal
- Department of Biochemistry, Institute of Chemistry, University of São Paulo, Av. Prof. Lineu Prestes 748, São Paulo, SP, 05508-000, Brazil.
| |
Collapse
|
19
|
Molloy EK, Warnow T. To Include or Not to Include: The Impact of Gene Filtering on Species Tree Estimation Methods. Syst Biol 2017; 67:285-303. [DOI: 10.1093/sysbio/syx077] [Citation(s) in RCA: 138] [Impact Index Per Article: 19.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2016] [Accepted: 09/13/2017] [Indexed: 01/27/2023] Open
Affiliation(s)
- Erin K Molloy
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| |
Collapse
|
20
|
Bhattacharyya S, Mukherjee J. IDXL: Species Tree Inference Using Internode Distance and Excess Gene Leaf Count. J Mol Evol 2017; 85:57-78. [PMID: 28835989 DOI: 10.1007/s00239-017-9807-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2016] [Accepted: 08/09/2017] [Indexed: 11/28/2022]
Abstract
We propose an extension of the distance matrix methods NJst and ASTRID to infer species trees from incongruent gene trees having Incomplete Lineage Sorting. Both approaches consider the average internode distance (ID) between individual taxa pairs as the distance measure. The measure ID does not use the root of a tree, and thus may not always infer the relative position of a taxon with respect to the root. We define a novel distance measure excess gene leaf count (XL) between individual couplets. The XL measure is computed using the root of a tree. It is proved to be additive, and is shown to infer the relative order of divergence among individual couplets better. We propose a novel method IDXL which uses both the XL and ID measures for species tree construction. IDXL is shown to perform better than NJst and other distance matrix approaches for most of the biological and simulated datasets. Having the same computational complexity as NJst, IDXL can be applied for species tree inference on large-scale biological datasets.
Collapse
Affiliation(s)
- Sourya Bhattacharyya
- Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, WB, 721302, India.
| | - Jayanta Mukherjee
- Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, WB, 721302, India
| |
Collapse
|
21
|
Dupuis J, Brunet B, Bird H, Lumley L, Fagua G, Boyle B, Levesque R, Cusson M, Powell J, Sperling F. Genome-wide SNPs resolve phylogenetic relationships in the North American spruce budworm (Choristoneura fumiferana) species complex. Mol Phylogenet Evol 2017; 111:158-168. [DOI: 10.1016/j.ympev.2017.04.001] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2016] [Revised: 02/27/2017] [Accepted: 04/03/2017] [Indexed: 01/02/2023]
|
22
|
Lu L, Cox CJ, Mathews S, Wang W, Wen J, Chen Z. Optimal data partitioning, multispecies coalescent and Bayesian concordance analyses resolve early divergences of the grape family (Vitaceae). Cladistics 2017; 34:57-77. [DOI: 10.1111/cla.12191] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/04/2017] [Indexed: 12/25/2022] Open
Affiliation(s)
- Limin Lu
- State Key Laboratory of Systematic and Evolutionary Botany Institute of Botany Chinese Academy of Sciences Beijing 100093 China
| | - Cymon J. Cox
- Centro de Ciências do Mar Universidade do Algarve Gambelas Faro 8005‐319 Portugal
| | - Sarah Mathews
- CSIRO National Research Collections Australian National Herbarium Canberra ACT 2601 Australia
| | - Wei Wang
- State Key Laboratory of Systematic and Evolutionary Botany Institute of Botany Chinese Academy of Sciences Beijing 100093 China
| | - Jun Wen
- Department of Botany National Museum of Natural History MRC166, Smithsonian Institution Washington DC 20013‐7012 USA
| | - Zhiduan Chen
- State Key Laboratory of Systematic and Evolutionary Botany Institute of Botany Chinese Academy of Sciences Beijing 100093 China
| |
Collapse
|
23
|
Foley NM, Springer MS, Teeling EC. Mammal madness: is the mammal tree of life not yet resolved? Philos Trans R Soc Lond B Biol Sci 2016; 371:20150140. [PMID: 27325836 PMCID: PMC4920340 DOI: 10.1098/rstb.2015.0140] [Citation(s) in RCA: 154] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/27/2016] [Indexed: 11/12/2022] Open
Abstract
Most molecular phylogenetic studies place all placental mammals into four superordinal groups, Laurasiatheria (e.g. dogs, bats, whales), Euarchontoglires (e.g. humans, rodents, colugos), Xenarthra (e.g. armadillos, anteaters) and Afrotheria (e.g. elephants, sea cows, tenrecs), and estimate that these clades last shared a common ancestor 90-110 million years ago. This phylogeny has provided a framework for numerous functional and comparative studies. Despite the high level of congruence among most molecular studies, questions still remain regarding the position and divergence time of the root of placental mammals, and certain 'hard nodes' such as the Laurasiatheria polytomy and Paenungulata that seem impossible to resolve. Here, we explore recent consensus and conflict among mammalian phylogenetic studies and explore the reasons for the remaining conflicts. The question of whether the mammal tree of life is or can be ever resolved is also addressed.This article is part of the themed issue 'Dating species divergences using rocks and clocks'.
Collapse
Affiliation(s)
- Nicole M Foley
- School of Biology and Environmental Science, Science Centre East, University College Dublin, Dublin 4, Ireland
| | - Mark S Springer
- Department of Biology, University of California, Riverside, CA 92521, USA
| | - Emma C Teeling
- School of Biology and Environmental Science, Science Centre East, University College Dublin, Dublin 4, Ireland
| |
Collapse
|
24
|
Simmons MP, Sloan DB, Gatesy J. The effects of subsampling gene trees on coalescent methods applied to ancient divergences. Mol Phylogenet Evol 2016; 97:76-89. [PMID: 26768112 DOI: 10.1016/j.ympev.2015.12.013] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2015] [Revised: 12/03/2015] [Accepted: 12/20/2015] [Indexed: 10/22/2022]
Abstract
Gene-tree-estimation error is a major concern for coalescent methods of phylogenetic inference. We sampled eight empirical studies of ancient lineages with diverse numbers of taxa and genes for which the original authors applied one or more coalescent methods. We found that the average pairwise congruence among gene trees varied greatly both between studies and also often within a study. We recommend that presenting plots of pairwise congruence among gene trees in a dataset be treated as a standard practice for empirical coalescent studies so that readers can readily assess the extent and distribution of incongruence among gene trees. ASTRAL-based coalescent analyses generally outperformed MP-EST and STAR with respect to both internal consistency (congruence between analyses of subsamples of genes with the complete dataset of all genes) and congruence with the concatenation-based topology. We evaluated the approach of subsampling gene trees that are, on average, more congruent with other gene trees as a method to reduce artifacts caused by gene-tree-estimation errors on coalescent analyses. We suggest that this method is well suited to testing whether gene-tree-estimation error is a primary cause of incongruence between concatenation- and coalescent-based results, to reconciling conflicting phylogenetic results based on different coalescent methods, and to identifying genes affected by artifacts that may then be targeted for reciprocal illumination. We provide scripts that automate the process of calculating pairwise gene-tree incongruence and subsampling trees while accounting for differential taxon sampling among genes. Finally, we assert that multiple tree-search replicates should be implemented as a standard practice for empirical coalescent studies that apply MP-EST.
Collapse
Affiliation(s)
- Mark P Simmons
- Department of Biology, Colorado State University, Fort Collins, CO 80523, USA.
| | - Daniel B Sloan
- Department of Biology, Colorado State University, Fort Collins, CO 80523, USA
| | - John Gatesy
- Department of Biology, University of California, Riverside, CA 92521, USA
| |
Collapse
|
25
|
Springer MS, Gatesy J. The gene tree delusion. Mol Phylogenet Evol 2016; 94:1-33. [DOI: 10.1016/j.ympev.2015.07.018] [Citation(s) in RCA: 145] [Impact Index Per Article: 18.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2015] [Revised: 06/04/2015] [Accepted: 07/22/2015] [Indexed: 10/23/2022]
|
26
|
Mirarab S, Warnow T. ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 2015; 31:i44-52. [PMID: 26072508 PMCID: PMC4765870 DOI: 10.1093/bioinformatics/btv234] [Citation(s) in RCA: 578] [Impact Index Per Article: 64.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Motivation: The estimation of species phylogenies requires multiple loci, since different loci can have different trees due to incomplete lineage sorting, modeled by the multi-species coalescent model. We recently developed a coalescent-based method, ASTRAL, which is statistically consistent under the multi-species coalescent model and which is more accurate than other coalescent-based methods on the datasets we examined. ASTRAL runs in polynomial time, by constraining the search space using a set of allowed ‘bipartitions’. Despite the limitation to allowed bipartitions, ASTRAL is statistically consistent. Results: We present a new version of ASTRAL, which we call ASTRAL-II. We show that ASTRAL-II has substantial advantages over ASTRAL: it is faster, can analyze much larger datasets (up to 1000 species and 1000 genes) and has substantially better accuracy under some conditions. ASTRAL’s running time is O(n2k|X|2), and ASTRAL-II’s running time is O(nk|X|2), where n is the number of species, k is the number of loci and X is the set of allowed bipartitions for the search space. Availability and implementation: ASTRAL-II is available in open source at https://github.com/smirarab/ASTRAL and datasets used are available at http://www.cs.utexas.edu/~phylo/datasets/astral2/. Contact:smirarab@gmail.com Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Siavash Mirarab
- Department of Computer Science, The University of Texas at Austin, Austin, TX 78712, USA and Departments of Computer Science and Bioengineering, The University of Illinois at Urbana-Champaign, Champaign, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, The University of Texas at Austin, Austin, TX 78712, USA and Departments of Computer Science and Bioengineering, The University of Illinois at Urbana-Champaign, Champaign, IL 61801, USA
| |
Collapse
|
27
|
Chou J, Gupta A, Yaduvanshi S, Davidson R, Nute M, Mirarab S, Warnow T. A comparative study of SVDquartets and other coalescent-based species tree estimation methods. BMC Genomics 2015; 16 Suppl 10:S2. [PMID: 26449249 PMCID: PMC4602346 DOI: 10.1186/1471-2164-16-s10-s2] [Citation(s) in RCA: 95] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Species tree estimation is challenging in the presence of incomplete lineage sorting (ILS), which can make gene trees different from the species tree. Because ILS is expected to occur and the standard concatenation approach can return incorrect trees with high support in the presence of ILS, "coalescent-based" summary methods (which first estimate gene trees and then combine gene trees into a species tree) have been developed that have theoretical guarantees of robustness to arbitrarily high amounts of ILS. Some studies have suggested that summary methods should only be used on "c-genes" (i.e., recombination-free loci) that can be extremely short (sometimes fewer than 100 sites). However, gene trees estimated on short alignments can have high estimation error, and summary methods tend to have high error on short c-genes. To address this problem, Chifman and Kubatko introduced SVDquartets, a new coalescent-based method. SVDquartets takes multi-locus unlinked single-site data, infers the quartet trees for all subsets of four species, and then combines the set of quartet trees into a species tree using a quartet amalgamation heuristic. Yet, the relative accuracy of SVDquartets to leading coalescent-based methods has not been assessed. RESULTS We compared SVDquartets to two leading coalescent-based methods (ASTRAL-2 and NJst), and to concatenation using maximum likelihood. We used a collection of simulated datasets, varying ILS levels, numbers of taxa, and number of sites per locus. Although SVDquartets was sometimes more accurate than ASTRAL-2 and NJst, most often the best results were obtained using ASTRAL-2, even on the shortest gene sequence alignments we explored (with only 10 sites per locus). Finally, concatenation was the most accurate of all methods under low ILS conditions. CONCLUSIONS ASTRAL-2 generally had the best accuracy under higher ILS conditions, and concatenation had the best accuracy under the lowest ILS conditions. However, SVDquartets was competitive with the best methods under conditions with low ILS and small numbers of sites per locus. The good performance under many conditions of ASTRAL-2 in comparison to SVDquartets is surprising given the known vulnerability of ASTRAL-2 and similar methods to short gene sequences.
Collapse
Affiliation(s)
- Jed Chou
- Department of Mathematics, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Ashu Gupta
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Shashank Yaduvanshi
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Ruth Davidson
- Department of Mathematics, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Mike Nute
- Department of Statistics, University of Illinois Urbana-Champaign, Champaign, IL 61820, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, CA 92093, USA
- Department of Computer Science, University of Texas at Austin, Austin, TX 78712, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
- Department of Computer Science, University of Texas at Austin, Austin, TX 78712, USA
| |
Collapse
|
28
|
Davidson R, Vachaspati P, Mirarab S, Warnow T. Phylogenomic species tree estimation in the presence of incomplete lineage sorting and horizontal gene transfer. BMC Genomics 2015; 16 Suppl 10:S1. [PMID: 26450506 PMCID: PMC4603753 DOI: 10.1186/1471-2164-16-s10-s1] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Species tree estimation is challenged by gene tree heterogeneity resulting from biological processes such as duplication and loss, hybridization, incomplete lineage sorting (ILS), and horizontal gene transfer (HGT). Mathematical theory about reconstructing species trees in the presence of HGT alone or ILS alone suggests that quartet-based species tree methods (known to be statistically consistent under ILS, or under bounded amounts of HGT) might be effective techniques for estimating species trees when both HGT and ILS are present. RESULTS We evaluated several publicly available coalescent-based methods and concatenation under maximum likelihood on simulated datasets with moderate ILS and varying levels of HGT. Our study shows that two quartet-based species tree estimation methods (ASTRAL-2 and weighted Quartets MaxCut) are both highly accurate, even on datasets with high rates of HGT. In contrast, although NJst and concatenation using maximum likelihood are highly accurate under low HGT, they are less robust to high HGT rates. CONCLUSION Our study shows that quartet-based species-tree estimation methods can be highly accurate under the presence of both HGT and ILS. The study suggests the possibility that some quartet-based methods might be statistically consistent under phylogenomic models of gene tree heterogeneity with both HGT and ILS.
Collapse
Affiliation(s)
- Ruth Davidson
- Department of Mathematics, University of Illinois at Urbana-Champaign, 1409 W. Green Street, 61801 Urbana, IL, USA
| | - Pranjal Vachaspati
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, 61801 Urbana, IL, USA
| | - Siavash Mirarab
- Department of Computer Science, University of Texas at Austin, 2317 Speedway, Stop D9500, 78712 Austin, TX, USA
- Department of Electrical and Computer Engineering, University of California at San Diego, 9500 Gilman Drive, 92093, La Jolla, CA, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, 61801 Urbana, IL, USA
- Department of Bioengineering, University of Illinois at Urbana-Champaign, 1270 Digital Computer Laboratory, MC-278, 61801 Urbana, IL, USA
| |
Collapse
|
29
|
Heyduk K, Trapnell DW, Barrett CF, Leebens-Mack J. Phylogenomic analyses of species relationships in the genusSabal(Arecaceae) using targeted sequence capture. Biol J Linn Soc Lond 2015. [DOI: 10.1111/bij.12551] [Citation(s) in RCA: 79] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Affiliation(s)
- Karolina Heyduk
- Department of Plant Biology; University of Georgia; Athens GA 30602 USA
| | | | - Craig F. Barrett
- Department of Biological Sciences; California State University; Los Angeles CA 90032 USA
| | - Jim Leebens-Mack
- Department of Plant Biology; University of Georgia; Athens GA 30602 USA
| |
Collapse
|
30
|
Roch S, Warnow T. On the Robustness to Gene Tree Estimation Error (or lack thereof) of Coalescent-Based Species Tree Methods. Syst Biol 2015; 64:663-76. [PMID: 25813358 DOI: 10.1093/sysbio/syv016] [Citation(s) in RCA: 95] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2014] [Accepted: 03/20/2015] [Indexed: 11/13/2022] Open
Abstract
The estimation of species trees using multiple loci has become increasingly common. Because different loci can have different phylogenetic histories (reflected in different gene tree topologies) for multiple biological causes, new approaches to species tree estimation have been developed that take gene tree heterogeneity into account. Among these multiple causes, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is potentially the most common cause of gene tree heterogeneity, and much of the focus of the recent literature has been on how to estimate species trees in the presence of ILS. Despite progress in developing statistically consistent techniques for estimating species trees when gene trees can differ due to ILS, there is substantial controversy in the systematics community as to whether to use the new coalescent-based methods or the traditional concatenation methods. One of the key issues that has been raised is understanding the impact of gene tree estimation error on coalescent-based methods that operate by combining gene trees. Here we explore the mathematical guarantees of coalescent-based methods when analyzing estimated rather than true gene trees. Our results provide some insight into the differences between promise of coalescent-based methods in theory and their performance in practice.
Collapse
Affiliation(s)
- Sebastien Roch
- Department of Mathematics, University of Wisconsin at Madison, 480 Lincoln Dr., Madison, Wisconsin, 53706, USA and Departments of Bioengineering and Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Tandy Warnow
- Department of Mathematics, University of Wisconsin at Madison, 480 Lincoln Dr., Madison, Wisconsin, 53706, USA and Departments of Bioengineering and Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| |
Collapse
|
31
|
Tonini J, Moore A, Stern D, Shcheglovitova M, Ortí G. Concatenation and Species Tree Methods Exhibit Statistically Indistinguishable Accuracy under a Range of Simulated Conditions. PLOS CURRENTS 2015; 7. [PMID: 25901289 PMCID: PMC4391732 DOI: 10.1371/currents.tol.34260cc27551a527b124ec5f6334b6be] [Citation(s) in RCA: 62] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Phylogeneticists have long understood that several biological processes can cause a gene tree to disagree with its species tree. In recent years, molecular phylogeneticists have increasingly foregone traditional supermatrix approaches in favor of species tree methods that account for one such source of error, incomplete lineage sorting (ILS). While gene tree-species tree discordance no doubt poses a significant challenge to phylogenetic inference with molecular data, researchers have only recently begun to systematically evaluate the relative accuracy of traditional and ILS-sensitive methods. Here, we report on simulations demonstrating that concatenation can perform as well or better than methods that attempt to account for sources of error introduced by ILS. Based on these and similar results from other researchers, we argue that concatenation remains a useful component of the phylogeneticist’s toolbox and highlight that phylogeneticists should continue to make explicit comparisons of results produced by contemporaneous and classical methods.
Collapse
Affiliation(s)
- João Tonini
- Department of Biological Sciences, The George Washington Univerisity, Washington, District of Columbia, USA
| | - Andrew Moore
- Department of Biological Sciences, The George Washington University, Washington, District of Columbia, USA
| | - David Stern
- Computational Biology Institute, Department of Biological Sciences, The George Washington University, Washington, District of Columbia, USA
| | - Maryia Shcheglovitova
- Department of Geography & Environmental Systems, University of Maryland Baltimore County, Baltimore, MD, USA
| | - Guillermo Ortí
- Department of Biological Sciences, The George Washington Univerisity, Washington, District of Columbia, USA
| |
Collapse
|