1
|
Patané JSL, Martins J, Setubal JC. A Guide to Phylogenomic Inference. Methods Mol Biol 2024; 2802:267-345. [PMID: 38819564 DOI: 10.1007/978-1-0716-3838-5_11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Phylogenomics aims at reconstructing the evolutionary histories of organisms taking into account whole genomes or large fractions of genomes. Phylogenomics has significant applications in fields such as evolutionary biology, systematics, comparative genomics, and conservation genetics, providing valuable insights into the origins and relationships of species and contributing to our understanding of biological diversity and evolution. This chapter surveys phylogenetic concepts and methods aimed at both gene tree and species tree reconstruction while also addressing common pitfalls, providing references to relevant computer programs. A practical phylogenomic analysis example including bacterial genomes is presented at the end of the chapter.
Collapse
Affiliation(s)
- José S L Patané
- Laboratório de Genética e Cardiologia Molecular, Instituto do Coração/Heart Institute Hospital das Clínicas - Faculdade de Medicina da Universidade de São Paulo São Paulo, São Paulo, SP, Brazil
| | - Joaquim Martins
- Integrative Omics group, Biorenewables National Laboratory, Brazilian Center for Research in Energy and Materials, Campinas, SP, Brazil
| | - João Carlos Setubal
- Departmento de Bioquímica, Instituto de Química, Universidade de São Paulo, São Paulo, SP, Brazil.
| |
Collapse
|
2
|
Liu B, Warnow T. Weighted ASTRID: fast and accurate species trees from weighted internode distances. Algorithms Mol Biol 2023; 18:6. [PMID: 37468904 PMCID: PMC10355063 DOI: 10.1186/s13015-023-00230-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Accepted: 06/10/2023] [Indexed: 07/21/2023] Open
Abstract
BACKGROUND Species tree estimation is a basic step in many biological research projects, but is complicated by the fact that gene trees can differ from the species tree due to processes such as incomplete lineage sorting (ILS), gene duplication and loss (GDL), and horizontal gene transfer (HGT), which can cause different regions within the genome to have different evolutionary histories (i.e., "gene tree heterogeneity"). One approach to estimating species trees in the presence of gene tree heterogeneity resulting from ILS operates by computing trees on each genomic region (i.e., computing "gene trees") and then using these gene trees to define a matrix of average internode distances, where the internode distance in a tree T between two species x and y is the number of nodes in T between the leaves corresponding to x and y. Given such a matrix, a tree can then be computed using methods such as neighbor joining. Methods such as ASTRID and NJst (which use this basic approach) are provably statistically consistent, very fast (low degree polynomial time) and have had high accuracy under many conditions that makes them competitive with other popular species tree estimation methods. In this study, inspired by the very recent work of weighted ASTRAL, we present weighted ASTRID, a variant of ASTRID that takes the branch uncertainty on the gene trees into account in the internode distance. RESULTS Our experimental study evaluating weighted ASTRID typically shows improvements in accuracy compared to the original (unweighted) ASTRID, and shows competitive accuracy against weighted ASTRAL, the state of the art. Our re-implementation of ASTRID also improves the runtime, with marked improvements on large datasets. CONCLUSIONS Weighted ASTRID is a new and very fast method for species tree estimation that typically improves upon ASTRID and has comparable accuracy to weighted ASTRAL, while remaining much faster. Weighted ASTRID is available at https://github.com/RuneBlaze/internode .
Collapse
Affiliation(s)
- Baqiao Liu
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL USA
| |
Collapse
|
3
|
Wilson D, Rogers JD. Evaluating Compression-Based Phylogeny Estimation in the Presence of Incomplete Lineage Sorting. J Comput Biol 2023; 30:250-260. [PMID: 36848254 DOI: 10.1089/cmb.2022.0197] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/01/2023] Open
Abstract
This study assesses characteristics of the normalized compression distance (NCD) technique for building phylogenetic trees from molecular data. We examined results from a mammalian biological data set as well as a collection of simulated data with varying levels of incomplete lineage sorting. The implementation of NCD we analyze is a concatenation-based, distance-based, alignment-free, and model-free phylogeny estimation method, which takes concatenated unaligned sequence data as input and outputs a matrix of distances. We compare the NCD phylogeny estimation method with various other methods, including coalescent- and concatenation-based methods.
Collapse
Affiliation(s)
- Deangelo Wilson
- School of Computing, DePaul University, Chicago, Illinois, USA
| | - John D Rogers
- School of Computing, DePaul University, Chicago, Illinois, USA
| |
Collapse
|
4
|
Sosiak CE, Borowiec ML, Barden P. An Eocene army ant. Biol Lett 2022; 18:20220398. [PMID: 36416032 PMCID: PMC9682434 DOI: 10.1098/rsbl.2022.0398] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 11/01/2022] [Indexed: 11/24/2022] Open
Abstract
Among social insects, army ants are exceptional in their voracious coordinated predation, nomadic life history and highly specialized wingless queens: the synthesis of these remarkable traits is referred to as the army ant syndrome. Despite molecular evidence that the army ant syndrome evolved twice during the mid-Cenozoic, once in the Neotropics and once in the Afrotropics, fossil army ants are markedly scarce, comprising a single known species from the Caribbean 16 Ma. Here we report the oldest army ant fossil and the first from the Eastern Hemisphere (EH), Dissimulodorylus perseus, preserved in Baltic amber dated to the Eocene. Using a combined morphological and molecular ultra conserved elements dataset spanning doryline lineages, we find that D. perseus is nested among extant EH army ants with affinities to Dorylus. Army ants are characterized by limited extant diversification throughout most of the Cenozoic; the discovery of D. perseus suggests an unexpected diversity of now-extinct army ant lineages in the Cenozoic, some of which were present in Continental Europe.
Collapse
Affiliation(s)
- Christine E. Sosiak
- Federated Department of Biological Sciences, New Jersey Institute of Technology, Newark, NJ 07102, USA
| | - Marek L. Borowiec
- Department of Agricultural Biology and C. P. Gillette Museum of Arthropod Diversity, Colorado State University, CO 80523, USA
| | - Phillip Barden
- Federated Department of Biological Sciences, New Jersey Institute of Technology, Newark, NJ 07102, USA
- Division of Invertebrate Zoology, American Museum of Natural History, New York, NY 10024, USA
| |
Collapse
|
5
|
Zhang C, Mirarab S. Weighting by Gene Tree Uncertainty Improves Accuracy of Quartet-based Species Trees. Mol Biol Evol 2022; 39:6750035. [PMID: 36201617 PMCID: PMC9750496 DOI: 10.1093/molbev/msac215] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 09/20/2022] [Accepted: 10/03/2022] [Indexed: 01/07/2023] Open
Abstract
Phylogenomic analyses routinely estimate species trees using methods that account for gene tree discordance. However, the most scalable species tree inference methods, which summarize independently inferred gene trees to obtain a species tree, are sensitive to hard-to-avoid errors introduced in the gene tree estimation step. This dilemma has created much debate on the merits of concatenation versus summary methods and practical obstacles to using summary methods more widely and to the exclusion of concatenation. The most successful attempt at making summary methods resilient to noisy gene trees has been contracting low support branches from the gene trees. Unfortunately, this approach requires arbitrary thresholds and poses new challenges. Here, we introduce threshold-free weighting schemes for the quartet-based species tree inference, the metric used in the popular method ASTRAL. By reducing the impact of quartets with low support or long terminal branches (or both), weighting provides stronger theoretical guarantees and better empirical performance than the unweighted ASTRAL. Our simulations show that weighting improves accuracy across many conditions and reduces the gap with concatenation in conditions with low gene tree discordance and high noise. On empirical data, weighting improves congruence with concatenation and increases support. Together, our results show that weighting, enabled by a new optimization algorithm we introduce, improves the utility of summary methods and can reduce the incongruence often observed across analytical pipelines.
Collapse
Affiliation(s)
- Chao Zhang
- Bioinformatics and Systems Biology, UC San Diego, La Jolla, CA, USA
| | | |
Collapse
|
6
|
Mahbub S, Sawmya S, Saha A, Reaz R, Rahman MS, Bayzid MS. Quartet Based Gene Tree Imputation Using Deep Learning Improves Phylogenomic Analyses Despite Missing Data. JOURNAL OF COMPUTATIONAL BIOLOGY : A JOURNAL OF COMPUTATIONAL MOLECULAR CELL BIOLOGY 2022; 29:1156-1172. [PMID: 36048555 DOI: 10.1089/cmb.2022.0212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Species tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, for a combination of reasons (ranging from sampling biases to more biological causes, as in gene birth and loss), gene trees are often incomplete, meaning that not all species of interest have a common set of genes. Incomplete gene trees can potentially impact the accuracy of phylogenomic inference. We, for the first time, introduce the problem of imputing the quartet distribution induced by a set of incomplete gene trees, which involves adding the missing quartets back to the quartet distribution. We present Quartet based Gene tree Imputation using Deep Learning (QT-GILD), an automated and specially tailored unsupervised deep learning technique, accompanied by cues from natural language processing, which learns the quartet distribution in a given set of incomplete gene trees and generates a complete set of quartets accordingly. QT-GILD is a general-purpose technique needing no explicit modeling of the subject system or reasons for missing data or gene tree heterogeneity. Experimental studies on a collection of simulated and empirical datasets suggest that QT-GILD can effectively impute the quartet distribution, which results in a dramatic improvement in the species tree accuracy. Remarkably, QT-GILD not only imputes the missing quartets but can also account for gene tree estimation error. Therefore, QT-GILD advances the state-of-the-art in species tree estimation from gene trees in the face of missing data.
Collapse
Affiliation(s)
- Sazan Mahbub
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh.,Department of Computer Science, University of Maryland, College Park, Maryland, USA
| | - Shashata Sawmya
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Arpita Saha
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Rezwana Reaz
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - M Sohel Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| |
Collapse
|
7
|
Dasarathy G, Mossel E, Nowak R, Roch S. A stochastic Farris transform for genetic data under the multispecies coalescent with applications to data requirements. J Math Biol 2022; 84:36. [PMID: 35394192 PMCID: PMC9258723 DOI: 10.1007/s00285-022-01731-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Revised: 02/15/2022] [Accepted: 02/17/2022] [Indexed: 10/18/2022]
Abstract
Species tree estimation faces many significant hurdles. Chief among them is that the trees describing the ancestral lineages of each individual gene-the gene trees-often differ from the species tree. The multispecies coalescent is commonly used to model this gene tree discordance, at least when it is believed to arise from incomplete lineage sorting, a population-genetic effect. Another significant challenge in this area is that molecular sequences associated to each gene typically provide limited information about the gene trees themselves. While the modeling of sequence evolution by single-site substitutions is well-studied, few species tree reconstruction methods with theoretical guarantees actually address this latter issue. Instead, a standard-but unsatisfactory-assumption is that gene trees are perfectly reconstructed before being fed into a so-called summary method. Hence much remains to be done in the development of inference methodologies that rigorously account for gene tree estimation error-or completely avoid gene tree estimation in the first place. In previous work, a data requirement trade-off was derived between the number of loci m needed for an accurate reconstruction and the length of the locus sequences k. It was shown that to reconstruct an internal branch of length f, one needs m to be of the order of [Formula: see text]. That previous result was obtained under the restrictive assumption that mutation rates as well as population sizes are constant across the species phylogeny. Here we further generalize this result beyond this assumption. Our main contribution is a novel reduction to the molecular clock case under the multispecies coalescent, which we refer to as a stochastic Farris transform. As a corollary, we also obtain a new identifiability result of independent interest: for any species tree with [Formula: see text] species, the rooted topology of the species tree can be identified from the distribution of its unrooted weighted gene trees even in the absence of a molecular clock.
Collapse
Affiliation(s)
- Gautam Dasarathy
- School of Electrical, Computer, and Energy Engineering, Arizona State University, Tempe, USA
| | - Elchanan Mossel
- Department of Mathematics and IDSS, Massachusetts Institute of Technology, Cambridge, USA
| | - Robert Nowak
- Department of Electrical and Computer Engineering, University of Wisconsin, Madison, USA
| | - Sebastien Roch
- Department of Mathematics, University of Wisconsin, Madison, USA.
| |
Collapse
|
8
|
Simmons MP, Springer MS, Gatesy J. Gene-tree misrooting drives conflicts in phylogenomic coalescent analyses of palaeognath birds. Mol Phylogenet Evol 2021; 167:107344. [PMID: 34748873 DOI: 10.1016/j.ympev.2021.107344] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Revised: 10/08/2021] [Accepted: 11/02/2021] [Indexed: 10/19/2022]
Abstract
Phylogenomic analyses of ancient rapid radiations can produce conflicting results that are driven by differential sampling of taxa and characters as well as the limitations of alternative analytical methods. We re-examine basal relationships of palaeognath birds (ratites and tinamous) using recently published datasets of nucleotide characters from 20,850 loci as well as 4301 retroelement insertions. The original studies attributed conflicting resolutions of rheas in their inferred coalescent and concatenation trees to concatenation failing in the anomaly zone. By contrast, we find that the coalescent-based resolution of rheas is premised upon extensive gene-tree estimation errors. Furthermore, retroelement insertions contain much more conflict than originally reported and multiple insertion loci support the basal position of rheas found in concatenation trees, while none were reported in the original publication. We demonstrate how even remarkable congruence in phylogenomic studies may be driven by long-branch misplacement of a divergent outgroup, highly incongruent gene trees, differential taxon sampling that can result in gene-tree misrooting errors that bias species-tree inference, and gross homology errors. What was previously interpreted as broad, robustly supported corroboration for a single resolution in coalescent analyses may instead indicate a common bias that taints phylogenomic results across multiple genome-scale datasets. The updated retroelement dataset now supports a species tree with branch lengths that suggest an ancient anomaly zone, and both concatenation and coalescent analyses of the huge nucleotide datasets fail to yield coherent, reliable results in this challenging phylogenetic context.
Collapse
Affiliation(s)
- Mark P Simmons
- Department of Biology, Colorado State University, Fort Collins, CO 80523, USA.
| | - Mark S Springer
- Department of Evolution, Ecology, and Organismal Biology, University of California, Riverside, CA 92521, USA
| | - John Gatesy
- Division of Vertebrate Zoology and Sackler Institute for Comparative Genomics, American Museum of Natural History, New York, NY 10024, USA
| |
Collapse
|
9
|
Adams RH, Castoe TA, DeGiorgio M. PhyloWGA: chromosome-aware phylogenetic interrogation of whole genome alignments. Bioinformatics 2021; 37:1923-1925. [PMID: 33051672 DOI: 10.1093/bioinformatics/btaa884] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 09/16/2020] [Accepted: 09/29/2020] [Indexed: 11/13/2022] Open
Abstract
SUMMARY Here, we present PhyloWGA, an open source R package for conducting phylogenetic analysis and investigation of whole genome data. AVAILABILITYAND IMPLEMENTATION Available at Github (https://github.com/radamsRHA/PhyloWGA). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Richard H Adams
- Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Todd A Castoe
- Department of Biology, University of Texas at Arlington, Arlington, TX 76019, USA
| | - Michael DeGiorgio
- Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| |
Collapse
|
10
|
Doyle JJ. Defining coalescent genes: Theory meets practice in organelle phylogenomics. Syst Biol 2021; 71:476-489. [PMID: 34191012 DOI: 10.1093/sysbio/syab053] [Citation(s) in RCA: 34] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2021] [Revised: 06/24/2021] [Accepted: 06/28/2021] [Indexed: 11/13/2022] Open
Abstract
The species tree paradigm that dominates current molecular systematic practice infers species trees from collections of sequences under assumptions of the multispecies coalescent (MSC), i.e., that there is free recombination between the sequences and no (or very low) recombination within them. These coalescent genes (c-genes) are thus defined in an historical rather than molecular sense, and can in theory be as large as an entire genome or as small as a single nucleotide. A debate about how to define c-genes centers on the contention that nuclear gene sequences used in many coalescent analyses undergo too much recombination, such that their introns comprise multiple c-genes, violating a key assumption of the MSC. Recently a similar argument has been made for the genes of plastid (e.g., chloroplast) and mitochondrial genomes, which for the last 30 or more years have been considered to represent a single c-gene for the purposes of phylogeny reconstruction because they are non-recombining in a historical sense. Consequently, it has been suggested that these genomes should be analyzed using coalescent methods that treat their genes-over 70 protein-coding genes in the case of most plastid genomes (plastomes)-as independent estimates of species phylogeny, in contrast to the usual practice of concatenation, which is appropriate for generating gene trees. However, although recombination certainly occurs in the plastome, as has been recognized since the 1970's, it is unlikely to be phylogenetically relevant. This is because such historically effective recombination can only occur when plastomes with incongruent histories are brought together in the same plastid. However, plastids sort rapidly into different cell lineages and rarely fuse. Thus, because of plastid biology, the plastome is a more canonical c-gene than is the average multi-intron mammalian nuclear gene. The plastome should thus continue to be treated as a single estimate of the underlying species phylogeny, as should the mitochondrial genome. The implications of this long-held insight of molecular systematics for studies in the phylogenomic era are explored.
Collapse
Affiliation(s)
- Jeff J Doyle
- Plant Biology Section, Plant Breeding & Genetics Section, and L. H. Bailey Hortorium, School of Integrative Plant Science, Cornell University, Ithaca, NY 14853 USA
| |
Collapse
|
11
|
Mahbub M, Wahab Z, Reaz R, Rahman MS, Bayzid MS. wQFM: Highly Accurate Genome-scale Species Tree Estimation from Weighted Quartets. Bioinformatics 2021; 37:3734-3743. [PMID: 34086858 DOI: 10.1093/bioinformatics/btab428] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Revised: 05/24/2021] [Accepted: 06/03/2021] [Indexed: 02/01/2023] Open
Abstract
MOTIVATION Species tree estimation from genes sampled from throughout the whole genome is complicated due to the gene tree-species tree discordance. Incomplete lineage sorting (ILS) is one of the most frequent causes for this discordance, where alleles can coexist in populations for periods that may span several speciation events. Quartet-based summary methods for estimating species trees from a collection of gene trees are becoming popular due to their high accuracy and statistical guarantee under ILS. Generating quartets with appropriate weights, where weights correspond to the relative importance of quartets, and subsequently amalgamating the weighted quartets to infer a single coherent species tree can allow for a statistically consistent way of estimating species trees. However, handling weighted quartets is challenging. RESULTS We propose wQFM, a highly accurate method for species tree estimation from multi-locus data, by extending the quartet FM (QFM) algorithm to a weighted setting. wQFM was assessed on a collection of simulated and real biological datasets, including the avian phylogenomic dataset which is one of the largest phylogenomic datasets to date. We compared wQFM with wQMC, which is the best alternate method for weighted quartet amalgamation, and with ASTRAL, which is one of the most accurate and widely used coalescent-based species tree estimation methods. Our results suggest that wQFM matches or improves upon the accuracy of wQMC and ASTRAL. AVAILABILITY wQFM is available in open source form at https://github.com/Mahim1997/wQFM-2020. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mahim Mahbub
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka-1205, Bangladesh
| | - Zahin Wahab
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka-1205, Bangladesh
| | - Rezwana Reaz
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka-1205, Bangladesh
| | - M Saifur Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka-1205, Bangladesh
| | - Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka-1205, Bangladesh
| |
Collapse
|
12
|
Farah IT, Islam MM, Zinat KT, Rahman AH, Bayzid MS. Species tree estimation from gene trees by minimizing deep coalescence and maximizing quartet consistency: a comparative study and the presence of pseudo species tree terraces. Syst Biol 2021; 70:1213-1231. [PMID: 33844023 DOI: 10.1093/sysbio/syab026] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2020] [Revised: 03/25/2021] [Accepted: 03/29/2021] [Indexed: 11/14/2022] Open
Abstract
Species tree estimation from multi-locus datasets is extremely challenging, especially in the presence of gene tree heterogeneity across the genome due to incomplete lineage sorting (ILS). Summary methods have been developed which estimate gene trees and then combine the gene trees to estimate a species tree by optimizing various optimization scores. In this study, we have extended and adapted the concept of phylogenetic terraces to species tree estimation by "summarizing" a set of gene trees, where multiple species trees with distinct topologies may have exactly the same optimality score (i.e., quartet score, extra lineage score, etc.). We particularly investigated the presence and impacts of equally optimal trees in species tree estimation from multi-locus data using summary methods by taking ILS into account. We analyzed two of the most popular ILS-aware optimization criteria: maximize quartet consistency (MQC) and minimize deep coalescence (MDC). Methods based on MQC are provably statistically consistent, whereas MDC is not a consistent criterion for species tree estimation. We present a comprehensive comparative study of these two optimality criteria. Our experiments, on a collection of datasets simulated under ILS, indicate that MDC may result in competitive or identical quartet consistency score as MQC, but could be significantly worse than MQC in terms of tree accuracy - demonstrating the presence and impacts of equally optimal species trees. This is the first known study that provides the conditions for the datasets to have equally optimal trees in the context of phylogenomic inference using summary methods.
Collapse
Affiliation(s)
- Ishrat Tanzila Farah
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology Dhaka-1205, Bangladesh
| | - Md Muktadirul Islam
- Applied Statistics and Data Science (ASDS), Department of Statistics Jahangirnagar University Dhaka-1342, Bangladesh
| | - Kazi Tasnim Zinat
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology Dhaka-1205, Bangladesh.,Department of Computer Science University of Maryland, College Park, Maryland, USA
| | - Atif Hasan Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology Dhaka-1205, Bangladesh
| | - Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology Dhaka-1205, Bangladesh
| |
Collapse
|
13
|
Bossert S, Murray EA, Pauly A, Chernyshov K, Brady SG, Danforth BN. Gene Tree Estimation Error with Ultraconserved Elements: An Empirical Study on Pseudapis Bees. Syst Biol 2020; 70:803-821. [PMID: 33367855 DOI: 10.1093/sysbio/syaa097] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2019] [Revised: 11/18/2020] [Accepted: 12/02/2020] [Indexed: 11/12/2022] Open
Abstract
Summarizing individual gene trees to species phylogenies using two-step coalescent methods is now a standard strategy in the field of phylogenomics. However, practical implementations of summary methods suffer from gene tree estimation error, which is caused by various biological and analytical factors. Greatly understudied is the choice of gene tree inference method and downstream effects on species tree estimation for empirical data sets. To better understand the impact of this method choice on gene and species tree accuracy, we compare gene trees estimated through four widely used programs under different model-selection criteria: PhyloBayes, MrBayes, IQ-Tree, and RAxML. We study their performance in the phylogenomic framework of $>$800 ultraconserved elements from the bee subfamily Nomiinae (Halictidae). Our taxon sampling focuses on the genus Pseudapis, a distinct lineage with diverse morphological features, but contentious morphology-based taxonomic classifications and no molecular phylogenetic guidance. We approximate topological accuracy of gene trees by assessing their ability to recover two uncontroversial, monophyletic groups, and compare branch lengths of individual trees using the stemminess metric (the relative length of internal branches). We further examine different strategies of removing uninformative loci and the collapsing of weakly supported nodes into polytomies. We then summarize gene trees with ASTRAL and compare resulting species phylogenies, including comparisons to concatenation-based estimates. Gene trees obtained with the reversible jump model search in MrBayes were most concordant on average and all Bayesian methods yielded gene trees with better stemminess values. The only gene tree estimation approach whose ASTRAL summary trees consistently produced the most likely correct topology, however, was IQ-Tree with automated model designation (ModelFinder program). We discuss these findings and provide practical advice on gene tree estimation for summary methods. Lastly, we establish the first phylogeny-informed classification for Pseudapis s. l. and map the distribution of distinct morphological features of the group. [ASTRAL; Bees; concordance; gene tree estimation error; IQ-Tree; MrBayes, Nomiinae; PhyloBayes; RAxML; phylogenomics; stemminess].
Collapse
Affiliation(s)
- Silas Bossert
- Department of Entomology, Cornell University, Comstock Hall, Ithaca, NY 14853, USA.,Department of Entomology, National Museum of Natural History, Smithsonian Institution, Washington, DC 20560, USA.,Department of Entomology, Washington State University, Pullman, Washington 99164, USA
| | - Elizabeth A Murray
- Department of Entomology, National Museum of Natural History, Smithsonian Institution, Washington, DC 20560, USA.,Department of Entomology, Washington State University, Pullman, Washington 99164, USA
| | - Alain Pauly
- O.D. Taxonomy and Phylogeny, Royal Belgian Institute of Natural Sciences, Rue Vautier 29, 1000 Brussels, Belgium
| | - Kyrylo Chernyshov
- College of Arts and Sciences, Cornell University, Ithaca, NY 14853, USA
| | - Seán G Brady
- Department of Entomology, National Museum of Natural History, Smithsonian Institution, Washington, DC 20560, USA
| | - Bryan N Danforth
- Department of Entomology, Cornell University, Comstock Hall, Ithaca, NY 14853, USA
| |
Collapse
|
14
|
Blaimer BB, Gotzek D, Brady SG, Buffington ML. Comprehensive phylogenomic analyses re-write the evolution of parasitism within cynipoid wasps. BMC Evol Biol 2020; 20:155. [PMID: 33228574 PMCID: PMC7686688 DOI: 10.1186/s12862-020-01716-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2020] [Accepted: 10/31/2020] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Parasitoidism, a specialized life strategy in which a parasite eventually kills its host, is frequently found within the insect order Hymenoptera (wasps, ants and bees). A parasitoid lifestyle is one of two dominant life strategies within the hymenopteran superfamily Cynipoidea, with the other being an unusual plant-feeding behavior known as galling. Less commonly, cynipoid wasps exhibit inquilinism, a strategy where some species have adapted to usurp other species' galls instead of inducing their own. Using a phylogenomic data set of ultraconserved elements from nearly all lineages of Cynipoidea, we here generate a robust phylogenetic framework and timescale to understand cynipoid systematics and the evolution of these life histories. RESULTS Our reconstructed evolutionary history for Cynipoidea differs considerably from previous hypotheses. Rooting our analyses with non-cynipoid outgroups, the Paraulacini, a group of inquilines, emerged as sister-group to the rest of Cynipoidea, rendering the gall wasp family Cynipidae paraphyletic. The families Ibaliidae and Liopteridae, long considered archaic and early-branching parasitoid lineages, were found nested well within the Cynipoidea as sister-group to the parasitoid Figitidae. Cynipoidea originated in the early Jurassic around 190 Ma. Either inquilinism or parasitoidism is suggested as the ancestral and dominant strategy throughout the early evolution of cynipoids, depending on whether a simple (three states: parasitoidism, inquilinism and galling) or more complex (seven states: parasitoidism, inquilinism and galling split by host use) model is employed. CONCLUSIONS Our study has significant impact on understanding cynipoid evolution and highlights the importance of adequate outgroup sampling. We discuss the evolutionary timescale of the superfamily in relation to their insect hosts and host plants, and outline how phytophagous galling behavior may have evolved from entomophagous, parasitoid cynipoids. Our study has established the framework for further physiological and comparative genomic work between gall-making, inquiline and parasitoid lineages, which could also have significant implications for the evolution of diverse life histories in other Hymenoptera.
Collapse
Affiliation(s)
- Bonnie B Blaimer
- Center for Integrative Biodiversity Discovery, Museum für Naturkunde, Berlin, Germany.
- National Museum of Natural History, Smithsonian Institution, Washington, DC, USA.
- North Carolina State University, Raleigh, NC, USA.
| | - Dietrich Gotzek
- National Museum of Natural History, Smithsonian Institution, Washington, DC, USA
| | - Seán G Brady
- National Museum of Natural History, Smithsonian Institution, Washington, DC, USA
| | - Matthew L Buffington
- Systematic Entomology Laboratory, ARS-USDA, C/O NMNH, Smithsonian Institution, Washington, DC, USA.
| |
Collapse
|
15
|
Rhodes JA. Topological Metrizations of Trees, and New Quartet Methods of Tree Inference. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:2107-2118. [PMID: 31095496 PMCID: PMC7650847 DOI: 10.1109/tcbb.2019.2917204] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Topological phylogenetic trees can be assigned edge weights in several natural ways, highlighting different aspects of the tree. Here, the rooted triple and quartet metrizations are introduced, and applied to formulate novel methods of inferring large trees from rooted triple and quartet data. These methods lead to new statistically consistent procedures for inference of a species tree from gene trees under the multispecies coalescent model.
Collapse
|
16
|
Layton KKS, Carvajal JI, Wilson NG. Mimicry and mitonuclear discordance in nudibranchs: New insights from exon capture phylogenomics. Ecol Evol 2020; 10:11966-11982. [PMID: 33209263 PMCID: PMC7664011 DOI: 10.1002/ece3.6727] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2020] [Revised: 07/09/2020] [Accepted: 07/10/2020] [Indexed: 11/29/2022] Open
Abstract
Phylogenetic inference and species delimitation can be challenging in taxonomic groups that have recently radiated and where introgression produces conflicting gene trees, especially when species delimitation has traditionally relied on mitochondrial data and color pattern. Chromodoris, a genus of colorful and toxic nudibranch in the Indo-Pacific, has been shown to have extraordinary cryptic diversity and mimicry, and has recently radiated, ultimately complicating species delimitation. In these cases, additional genome-wide data can help improve phylogenetic resolution and provide important insights about evolutionary history. Here, we employ a transcriptome-based exon capture approach to resolve Chromodoris phylogeny with data from 2,925 exons and 1,630 genes, derived from 15 nudibranch transcriptomes. We show that some previously identified mimics instead show mitonuclear discordance, likely deriving from introgression or mitochondrial capture, but we confirm one "pure" mimic in Western Australia. Sister-species relationships and species-level entities were recovered with high support in both concatenated maximum likelihood (ML) and summary coalescent phylogenies, but the ML topologies were highly variable while the coalescent topologies were consistent across datasets. Our work also demonstrates the broad phylogenetic utility of 149 genes that were previously identified from eupulmonate gastropods. This study is one of the first to (a) demonstrate the efficacy of exon capture for recovering relationships among recently radiated invertebrate taxa, (b) employ genome-wide nuclear markers to test mimicry hypotheses in nudibranchs and (c) provide evidence for introgression and mitochondrial capture in nudibranchs.
Collapse
Affiliation(s)
- Kara K. S. Layton
- Centre for Evolutionary BiologySchool of Biological SciencesUniversity of Western AustraliaCrawleyWAAustralia
- Collections & ResearchWestern Australian MuseumWelshpoolWAAustralia
- School of Biological Sciences, Zoology BuildingUniversity of AberdeenAberdeenUK
| | - Jose I. Carvajal
- Collections & ResearchWestern Australian MuseumWelshpoolWAAustralia
| | - Nerida G. Wilson
- Centre for Evolutionary BiologySchool of Biological SciencesUniversity of Western AustraliaCrawleyWAAustralia
- Collections & ResearchWestern Australian MuseumWelshpoolWAAustralia
| |
Collapse
|
17
|
Portik DM, Wiens JJ. Do Alignment and Trimming Methods Matter for Phylogenomic (UCE) Analyses? Syst Biol 2020; 70:440-462. [PMID: 32797207 DOI: 10.1093/sysbio/syaa064] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2019] [Revised: 08/02/2020] [Accepted: 08/03/2020] [Indexed: 11/14/2022] Open
Abstract
Alignment is a crucial issue in molecular phylogenetics because different alignment methods can potentially yield very different topologies for individual genes. But it is unclear if the choice of alignment methods remains important in phylogenomic analyses, which incorporate data from hundreds or thousands of genes. For example, problematic biases in alignment might be multiplied across many loci, whereas alignment errors in individual genes might become irrelevant. The issue of alignment trimming (i.e., removing poorly aligned regions or missing data from individual genes) is also poorly explored. Here, we test the impact of 12 different combinations of alignment and trimming methods on phylogenomic analyses. We compare these methods using published phylogenomic data from ultraconserved elements (UCEs) from squamate reptiles (lizards and snakes), birds, and tetrapods. We compare the properties of alignments generated by different alignment and trimming methods (e.g., length, informative sites, missing data). We also test whether these data sets can recover well-established clades when analyzed with concatenated (RAxML) and species-tree methods (ASTRAL-III), using the full data ($\sim $5000 loci) and subsampled data sets (10% and 1% of loci). We show that different alignment and trimming methods can significantly impact various aspects of phylogenomic data sets (e.g., length, informative sites). However, these different methods generally had little impact on the recovery and support values for well-established clades, even across very different numbers of loci. Nevertheless, our results suggest several "best practices" for alignment and trimming. Intriguingly, the choice of phylogenetic methods impacted the phylogenetic results most strongly, with concatenated analyses recovering significantly more well-established clades (with stronger support) than the species-tree analyses. [Alignment; concatenated analysis; phylogenomics; sequence length heterogeneity; species-tree analysis; trimming].
Collapse
Affiliation(s)
- Daniel M Portik
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721, USA.,California Academy of Sciences, San Francisco, CA 94118, USA
| | - John J Wiens
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721, USA
| |
Collapse
|
18
|
Yourdkhani S, Rhodes JA. Inferring Metric Trees from Weighted Quartets via an Intertaxon Distance. Bull Math Biol 2020; 82:97. [PMID: 32676801 DOI: 10.1007/s11538-020-00773-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2020] [Accepted: 07/02/2020] [Indexed: 11/24/2022]
Abstract
A metric phylogenetic tree relating a collection of taxa induces weighted rooted triples and weighted quartets for all subsets of three and four taxa, respectively. New intertaxon distances are defined that can be calculated from these weights, and shown to exactly fit the same tree topology, but with edge weights rescaled by certain factors dependent on the associated split size. These distances are analogs for metric trees of similar ones recently introduced for topological trees that are based on induced unweighted rooted triples and quartets. The distances introduced here lead to new statistically consistent methods of inferring a metric species tree from a collection of topological gene trees generated under the multispecies coalescent model of incomplete lineage sorting. Simulations provide insight into their potential.
Collapse
Affiliation(s)
- Samaneh Yourdkhani
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, 99775, USA
| | - John A Rhodes
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, 99775, USA.
| |
Collapse
|
19
|
Yang L, Su D, Chang X, Foster CS, Sun L, Huang CH, Zhou X, Zeng L, Ma H, Zhong B. Phylogenomic Insights into Deep Phylogeny of Angiosperms Based on Broad Nuclear Gene Sampling. PLANT COMMUNICATIONS 2020; 1:100027. [PMID: 33367231 PMCID: PMC7747974 DOI: 10.1016/j.xplc.2020.100027] [Citation(s) in RCA: 48] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Revised: 01/23/2020] [Accepted: 01/25/2020] [Indexed: 05/02/2023]
Abstract
Angiosperms (flowering plants) are the most diverse and species-rich group of plants. The vast majority (∼99.95%) of angiosperms form a clade called Mesangiospermae, which is subdivided into five major groups: eudicots, monocots, magnoliids, Chloranthales, and Ceratophyllales. The relationships among these Mesangiospermae groups have been the subject of long debate. In this study, we assembled a phylogenomic dataset of 1594 genes from 151 angiosperm taxa, including representatives of all five lineages, to investigate the phylogeny of major angiosperm lineages under both coalescent- and concatenation-based methods. We dissected the phylogenetic signal and found that more than half of the genes lack phylogenetic information for the backbone of angiosperm phylogeny. We further removed the genes with weak phylogenetic signal and showed that eudicots, Ceratophyllales, and Chloranthales form a clade, with magnoliids and monocots being the next successive sister lineages. Similar frequencies of gene tree conflict are suggestive of incomplete lineage sorting along the backbone of the angiosperm phylogeny. Our analyses suggest that a fully bifurcating species tree may not be the best way to represent the early radiation of angiosperms. Meanwhile, we inferred that the crown-group angiosperms originated approximately between 255.1 and 222.2 million years ago, and Mesangiospermae diversified into the five extant groups in a short time span (∼27 million years) at the Early to Late Jurassic.
Collapse
Affiliation(s)
- Lingxiao Yang
- College of Life Sciences, Nanjing Normal University, Nanjing, China
| | - Danyan Su
- College of Life Sciences, Nanjing Normal University, Nanjing, China
| | - Xin Chang
- College of Life Sciences, Nanjing Normal University, Nanjing, China
| | - Charles S.P. Foster
- School of Life and Environmental Sciences, University of Sydney, Sydney, Australia
| | - Linhua Sun
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
| | - Chien-Hsun Huang
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, Ministry of Education Key Laboratory of Biodiversity Sciences and Ecological Engineering, School of Life Sciences, Fudan University, Shanghai, China
| | - Xiaofan Zhou
- Guangdong Province Key Laboratory of Microbial Signals and Disease Control, Integrative Microbiology Research Centre, South China Agricultural University, Guangzhou, China
| | - Liping Zeng
- Institute for Integrative Genome Biology and Department of Botany and Plant Sciences, University of California, Riverside, CA, USA
| | - Hong Ma
- Department of Biology, Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, USA
| | - Bojian Zhong
- College of Life Sciences, Nanjing Normal University, Nanjing, China
| |
Collapse
|
20
|
Murphy B, Forest F, Barraclough T, Rosindell J, Bellot S, Cowan R, Golos M, Jebb M, Cheek M. A phylogenomic analysis of Nepenthes (Nepenthaceae). Mol Phylogenet Evol 2020; 144:106668. [DOI: 10.1016/j.ympev.2019.106668] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2019] [Revised: 10/28/2019] [Accepted: 10/29/2019] [Indexed: 10/25/2022]
|
21
|
Hu Y, Xing W, Hu Z, Liu G. Phylogenetic Analysis and Substitution Rate Estimation of Colonial Volvocine Algae Based on Mitochondrial Genomes. Genes (Basel) 2020; 11:genes11010115. [PMID: 31968709 PMCID: PMC7016891 DOI: 10.3390/genes11010115] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 01/13/2020] [Accepted: 01/15/2020] [Indexed: 01/30/2023] Open
Abstract
We sequenced the mitochondrial genome of six colonial volvocine algae, namely: Pandorina morum, Pandorina colemaniae, Volvulina compacta, Colemanosphaera angeleri, Colemanosphaera charkowiensi, and Yamagishiella unicocca. Previous studies have typically reconstructed the phylogenetic relationship between colonial volvocine algae based on chloroplast or nuclear genes. Here, we explore the validity of phylogenetic analysis based on mitochondrial protein-coding genes. We found phylogenetic incongruence of the genera Yamagishiella and Colemanosphaera. In Yamagishiella, the stochastic error and linkage group formed by the mitochondrial protein-coding genes prevent phylogenetic analyses from reflecting the true relationship. In Colemanosphaera, a different reconstruction approach revealed a different phylogenetic relationship. This incongruence may be because of the influence of biological factors, such as incomplete lineage sorting or horizontal gene transfer. We also analyzed the substitution rates in the mitochondrial and chloroplast genomes between colonial volvocine algae. Our results showed that all volvocine species showed significantly higher substitution rates for the mitochondrial genome compared with the chloroplast genome. The nonsynonymous substitution (dN)/synonymous substitution (dS) ratio is similar in the genomes of both organelles in most volvocine species, suggesting that the two counterparts are under a similar selection pressure. We also identified a few chloroplast protein-coding genes that showed high dN/dS ratios in some species, resulting in a significant dN/dS ratio difference between the mitochondrial and chloroplast genomes.
Collapse
Affiliation(s)
- Yuxin Hu
- Key Laboratory of Algal Biology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China
- School of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Weiyue Xing
- Key Laboratory of Algal Biology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China
- School of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zhengyu Hu
- State Key Laboratory of Freshwater Ecology and Biotechnology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China
| | - Guoxiang Liu
- Key Laboratory of Algal Biology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China
- Correspondence: ; Tel.: +86-027-6878-0576
| |
Collapse
|
22
|
Christensen S, Molloy EK, Vachaspati P, Yammanuru A, Warnow T. Non-parametric correction of estimated gene trees using TRACTION. Algorithms Mol Biol 2020; 15:1. [PMID: 31911812 PMCID: PMC6942343 DOI: 10.1186/s13015-019-0161-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2019] [Accepted: 12/18/2019] [Indexed: 11/16/2022] Open
Abstract
Motivation Estimated gene trees are often inaccurate, due to insufficient phylogenetic signal in the single gene alignment, among other causes. Gene tree correction aims to improve the accuracy of an estimated gene tree by using computational techniques along with auxiliary information, such as a reference species tree or sequencing data. However, gene trees and species trees can differ as a result of gene duplication and loss (GDL), incomplete lineage sorting (ILS), and other biological processes. Thus gene tree correction methods need to take estimation error as well as gene tree heterogeneity into account. Many prior gene tree correction methods have been developed for the case where GDL is present. Results Here, we study the problem of gene tree correction where gene tree heterogeneity is instead due to ILS and/or HGT. We introduce TRACTION, a simple polynomial time method that provably finds an optimal solution to the RF-optimal tree refinement and completion (RF-OTRC) Problem, which seeks a refinement and completion of a singly-labeled gene tree with respect to a given singly-labeled species tree so as to minimize the Robinson−Foulds (RF) distance. Our extensive simulation study on 68,000 estimated gene trees shows that TRACTION matches or improves on the accuracy of well-established methods from the GDL literature when HGT and ILS are both present, and ties for best under the ILS-only conditions. Furthermore, TRACTION ties for fastest on these datasets. We also show that a naive generalization of the RF-OTRC problem to multi-labeled trees is possible, but can produce misleading results where gene tree heterogeneity is due to GDL.
Collapse
|
23
|
Springer MS, Molloy EK, Sloan DB, Simmons MP, Gatesy J. ILS-Aware Analysis of Low-Homoplasy Retroelement Insertions: Inference of Species Trees and Introgression Using Quartets. J Hered 2019; 111:147-168. [DOI: 10.1093/jhered/esz076] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2019] [Accepted: 12/12/2019] [Indexed: 12/20/2022] Open
Abstract
Abstract
DNA sequence alignments have provided the majority of data for inferring phylogenetic relationships with both concatenation and coalescent methods. However, DNA sequences are susceptible to extensive homoplasy, especially for deep divergences in the Tree of Life. Retroelement insertions have emerged as a powerful alternative to sequences for deciphering evolutionary relationships because these data are nearly homoplasy-free. In addition, retroelement insertions satisfy the “no intralocus-recombination” assumption of summary coalescent methods because they are singular events and better approximate neutrality relative to DNA loci commonly sampled in phylogenomic studies. Retroelements have traditionally been analyzed with parsimony, distance, and network methods. Here, we analyze retroelement data sets for vertebrate clades (Placentalia, Laurasiatheria, Balaenopteroidea, Palaeognathae) with 2 ILS-aware methods that operate by extracting, weighting, and then assembling unrooted quartets into a species tree. The first approach constructs a species tree from retroelement bipartitions with ASTRAL, and the second method is based on split-decomposition with parsimony. We also develop a Quartet-Asymmetry test to detect hybridization using retroelements. Both ILS-aware methods recovered the same species-tree topology for each data set. The ASTRAL species trees for Laurasiatheria have consecutive short branch lengths in the anomaly zone whereas Palaeognathae is outside of this zone. For the Balaenopteroidea data set, which includes rorquals (Balaenopteridae) and gray whale (Eschrichtiidae), both ILS-aware methods resolved balaeonopterids as paraphyletic. Application of the Quartet-Asymmetry test to this data set detected 19 different quartets of species for which historical introgression may be inferred. Evidence for introgression was not detected in the other data sets.
Collapse
Affiliation(s)
- Mark S Springer
- Department of Evolution, Ecology, and Organismal Biology, University of California, Riverside, CA
| | - Erin K Molloy
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL
| | - Daniel B Sloan
- Department of Biology, Colorado State University, Fort Collins, CO
| | - Mark P Simmons
- Department of Biology, Colorado State University, Fort Collins, CO
| | - John Gatesy
- Division of Vertebrate Zoology and Sackler Institute for Comparative Genomics, American Museum of Natural History, New York, NY
| |
Collapse
|
24
|
Simmons MP, Kessenich J. Divergence and support among slightly suboptimal likelihood gene trees. Cladistics 2019; 36:322-340. [DOI: 10.1111/cla.12404] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/11/2019] [Indexed: 12/18/2022] Open
Affiliation(s)
- Mark P. Simmons
- Department of Biology Colorado State University Fort Collins CO 80523‐1878 USA
| | - John Kessenich
- 305 W. Magnolia Street PMB 134 Fort Collins CO 80521 USA
| |
Collapse
|
25
|
Gatesy J, Sloan DB, Warren JM, Baker RH, Simmons MP, Springer MS. Partitioned coalescence support reveals biases in species-tree methods and detects gene trees that determine phylogenomic conflicts. Mol Phylogenet Evol 2019; 139:106539. [DOI: 10.1016/j.ympev.2019.106539] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2018] [Revised: 06/10/2019] [Accepted: 06/17/2019] [Indexed: 12/26/2022]
|
26
|
Abstract
Green plants (Viridiplantae) include around 450,000-500,000 species1,2 of great diversity and have important roles in terrestrial and aquatic ecosystems. Here, as part of the One Thousand Plant Transcriptomes Initiative, we sequenced the vegetative transcriptomes of 1,124 species that span the diversity of plants in a broad sense (Archaeplastida), including green plants (Viridiplantae), glaucophytes (Glaucophyta) and red algae (Rhodophyta). Our analysis provides a robust phylogenomic framework for examining the evolution of green plants. Most inferred species relationships are well supported across multiple species tree and supermatrix analyses, but discordance among plastid and nuclear gene trees at a few important nodes highlights the complexity of plant genome evolution, including polyploidy, periods of rapid speciation, and extinction. Incomplete sorting of ancestral variation, polyploidization and massive expansions of gene families punctuate the evolutionary history of green plants. Notably, we find that large expansions of gene families preceded the origins of green plants, land plants and vascular plants, whereas whole-genome duplications are inferred to have occurred repeatedly throughout the evolution of flowering plants and ferns. The increasing availability of high-quality plant genome sequences and advances in functional genomics are enabling research on genome evolution across the green tree of life.
Collapse
|
27
|
Comparative Phylogenomics, a Stepping Stone for Bird Biodiversity Studies. DIVERSITY-BASEL 2019. [DOI: 10.3390/d11070115] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Abstract
Birds are a group with immense availability of genomic resources, and hundreds of forthcoming genomes at the doorstep. We review recent developments in whole genome sequencing, phylogenomics, and comparative genomics of birds. Short read based genome assemblies are common, largely due to efforts of the Bird 10K genome project (B10K). Chromosome-level assemblies are expected to increase due to improved long-read sequencing. The available genomic data has enabled the reconstruction of the bird tree of life with increasing confidence and resolution, but challenges remain in the early splits of Neoaves due to their explosive diversification after the Cretaceous-Paleogene (K-Pg) event. Continued genomic sampling of the bird tree of life will not just better reflect their evolutionary history but also shine new light onto the organization of phylogenetic signal and conflict across the genome. The comparatively simple architecture of avian genomes makes them a powerful system to study the molecular foundation of bird specific traits. Birds are on the verge of becoming an extremely resourceful system to study biodiversity from the nucleotide up.
Collapse
|
28
|
Statistical binning leads to profound model violation due to gene tree error incurred by trying to avoid gene tree error. Mol Phylogenet Evol 2019; 134:164-171. [DOI: 10.1016/j.ympev.2019.02.012] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2018] [Revised: 11/30/2018] [Accepted: 02/14/2019] [Indexed: 11/19/2022]
|
29
|
Roch S, Nute M, Warnow T. Long-Branch Attraction in Species Tree Estimation: Inconsistency of Partitioned Likelihood and Topology-Based Summary Methods. Syst Biol 2019; 68:281-297. [PMID: 30247732 DOI: 10.1093/sysbio/syy061] [Citation(s) in RCA: 54] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2018] [Accepted: 09/12/2018] [Indexed: 11/13/2022] Open
Abstract
With advances in sequencing technologies, there are now massive amounts of genomic data from across all life, leading to the possibility that a robust Tree of Life can be constructed. However, "gene tree heterogeneity", which is when different genomic regions can evolve differently, is a common phenomenon in multi-locus data sets, and reduces the accuracy of standard methods for species tree estimation that do not take this heterogeneity into account. New methods have been developed for species tree estimation that specifically address gene tree heterogeneity, and that have been proven to converge to the true species tree when the number of loci and number of sites per locus both increase (i.e., the methods are said to be "statistically consistent"). Yet, little is known about the biologically realistic condition where the number of sites per locus is bounded. We show that when the sequence length of each locus is bounded (by any arbitrarily chosen value), the most common approaches to species tree estimation that take heterogeneity into account (i.e., traditional fully partitioned concatenated maximum likelihood and newer approaches, called summary methods, that estimate the species tree by combining estimated gene trees) are not statistically consistent, even when the heterogeneity is extremely constrained. The main challenge is the presence of conditions such as long branch attraction that create biased tree estimation when the number of sites is restricted. Hence, our study uncovers a fundamental challenge to species tree estimation using both traditional and new methods.
Collapse
Affiliation(s)
- Sebastien Roch
- Department of Mathematics, University of Wisconsin-Madison, 480 Lincoln Dr, Madison, WI 53706, USA
| | - Michael Nute
- Department of Statistics, The University of Illinois at Urbana-Champaign, 725 S Wright St #101, Champaign, IL 61820, USA
| | - Tandy Warnow
- Department of Computer Science, The University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, IL 61801-2302, USA
| |
Collapse
|
30
|
Allman ES, Long C, Rhodes JA. SPECIES TREE INFERENCE FROM GENOMIC SEQUENCES USING THE LOG-DET DISTANCE. SIAM JOURNAL ON APPLIED ALGEBRA AND GEOMETRY 2019; 3:107-127. [PMID: 33163826 PMCID: PMC7643864 DOI: 10.1137/18m1194134] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
The log-det distance between two aligned DNA sequences was introduced as a tool for statistically consistent inference of a gene tree under simple non-mixture models of sequence evolution. Here we prove that the log-det distance, coupled with a distance-based tree construction method, also permits consistent inference of species trees under mixture models appropriate to aligned genomic-scale sequences data. Data may include sites from many genetic loci, which evolved on different gene trees due to incomplete lineage sorting on an ultrametric species tree, with different time-reversible substitution processes. The simplicity and speed of distance-based inference suggests log-det based methods should serve as benchmarks for judging more elaborate and computationally-intensive species trees inference methods.
Collapse
Affiliation(s)
- Elizabeth S Allman
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK
| | - Colby Long
- Mathematical Biosciences Institute, The Ohio State University, Columbus, OH
| | - John A Rhodes
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK
| |
Collapse
|
31
|
Shin S, Clarke DJ, Lemmon AR, Moriarty Lemmon E, Aitken AL, Haddad S, Farrell BD, Marvaldi AE, Oberprieler RG, McKenna DD. Phylogenomic Data Yield New and Robust Insights into the Phylogeny and Evolution of Weevils. Mol Biol Evol 2019; 35:823-836. [PMID: 29294021 DOI: 10.1093/molbev/msx324] [Citation(s) in RCA: 59] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
The phylogeny and evolution of weevils (the beetle superfamily Curculionoidea) has been extensively studied, but many relationships, especially in the large family Curculionidae (true weevils; > 50,000 species), remain uncertain. We used phylogenomic methods to obtain DNA sequences from 522 protein-coding genes for representatives of all families of weevils and all subfamilies of Curculionidae. Most of our phylogenomic results had strong statistical support, and the inferred relationships were generally congruent with those reported in previous studies, but with some interesting exceptions. Notably, the backbone relationships of the weevil phylogeny were consistently strongly supported, and the former Nemonychidae (pine flower snout beetles) were polyphyletic, with the subfamily Cimberidinae (here elevated to Cimberididae) placed as sister group of all other weevils. The clade comprising the sister families Brentidae (straight-snouted weevils) and Curculionidae was maximally supported and the composition of both families was firmly established. The contributions of substitution modeling, codon usage and/or mutational bias to differences between trees reconstructed from amino acid and nucleotide sequences were explored. A reconstructed timetree for weevils is consistent with a Mesozoic radiation of gymnosperm-associated taxa to form most extant families and diversification of Curculionidae alongside flowering plants-first monocots, then other groups-beginning in the Cretaceous.
Collapse
Affiliation(s)
- Seunggwan Shin
- Department of Biological Sciences, University of Memphis, Memphis, TN
| | - Dave J Clarke
- Department of Biological Sciences, University of Memphis, Memphis, TN
| | - Alan R Lemmon
- Department of Scientific Computing, Florida State University, Tallahassee, FL
| | | | | | - Stephanie Haddad
- Department of Biological Sciences, University of Memphis, Memphis, TN
| | - Brian D Farrell
- Museum of Comparative Zoology, Harvard University, Cambridge, MA
| | - Adriana E Marvaldi
- CONICET, División Entomología, Universidad Nacional de La Plata, La Plata, Buenos Aires, Argentina
| | | | - Duane D McKenna
- Department of Biological Sciences, University of Memphis, Memphis, TN
| |
Collapse
|
32
|
Simmons MP, Sloan DB, Springer MS, Gatesy J. Gene-wise resampling outperforms site-wise resampling in phylogenetic coalescence analyses. Mol Phylogenet Evol 2019; 131:80-92. [DOI: 10.1016/j.ympev.2018.10.001] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2018] [Accepted: 10/01/2018] [Indexed: 01/15/2023]
|
33
|
Borowiec ML. Convergent Evolution of the Army Ant Syndrome and Congruence in Big-Data Phylogenetics. Syst Biol 2019; 68:642-656. [DOI: 10.1093/sysbio/syy088] [Citation(s) in RCA: 35] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2018] [Revised: 11/09/2018] [Accepted: 12/15/2018] [Indexed: 11/12/2022] Open
Affiliation(s)
- Marek L Borowiec
- Department of Entomology, Plant Pathology and Nematology, 875 Perimeter Drive, University of Idaho, Moscow, ID 83844, USA
- School of Life Sciences, Social Insect Research Group, Arizona State University, Tempe, AZ 85287, USA
- Department of Entomology and Nematology, One Shields Avenue, University of California at Davis, Davis, CA 95616, USA
| |
Collapse
|
34
|
Adams RH, Castoe TA. Supergene validation: A model-based protocol for assessing the accuracy of non-model-based supergene methods. MethodsX 2019; 6:2181-2188. [PMID: 31667118 PMCID: PMC6812401 DOI: 10.1016/j.mex.2019.09.025] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2019] [Accepted: 09/19/2019] [Indexed: 11/16/2022] Open
Abstract
Genome-scale species tree inference is largely restricted to heuristic approaches that use estimated gene trees to reconstruct species-level relationships. Central to these heuristic species tree methods is the assumption that the gene trees are estimated without error. To increase the accuracy of input gene trees used to infer species trees, several techniques have recently been developed for constructing longer “supergenes” that represent sets of loci inferred to share the same genealogical history. While these supergene methods are designed to increase the amount of data for gene tree estimation by concatenating several loci into “supergenes” to increase gene tree accuracy, no formal protocols have been proposed to validate this key “supergene” concatenation step. In a recent study, we developed several supergene validation strategies for assessing the accuracy of a popular supergene method: the so-called “statistical binning” pipeline. In this article, we describe a more generalizable and model-based “supergene validation” protocol for assessing the accuracy of supergenes and supergene methods using model-based tests of phylogenetic congruency. Supergenes are validated by adopting model-based tests of topological congruence These model-based procedures out preform non-model based methods for supergene construction The results of this protocol can be used to assess the overall performance of a supergene method across a phylogenomic dataset
Collapse
|
35
|
Mclean BS, Bell KC, Allen JM, Helgen KM, Cook JA. Impacts of Inference Method and Data set Filtering on Phylogenomic Resolution in a Rapid Radiation of Ground Squirrels (Xerinae: Marmotini). Syst Biol 2018; 68:298-316. [DOI: 10.1093/sysbio/syy064] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2017] [Accepted: 09/12/2018] [Indexed: 12/20/2022] Open
Affiliation(s)
- Bryan S Mclean
- Department of Biology and Museum of Southwestern Biology, 1 University of New Mexico, MSC03-2020, Albuquerque, NM 87131, USA
- Florida Museum of Natural History, University of Florida, 1659 Museum Road, Gainesville, FL 32611, USA
| | - Kayce C Bell
- Department of Biology and Museum of Southwestern Biology, 1 University of New Mexico, MSC03-2020, Albuquerque, NM 87131, USA
- Department of Invertebrate Zoology, Smithsonian Institution National Museum of Natural History, P.O. Box 37012, MRC 163, Washington, DC 20013-7012, USA
| | - Julie M Allen
- Department of Biology, University of Nevada, 1664 N. Virginia Street, Reno, NV 89557, USA
| | - Kristofer M Helgen
- Department of Ecology and Evolutionary Biology, School of Biological Sciences, University of Adelaide, North Terrace, Adelaide SA 5005, Australia
| | - Joseph A Cook
- Department of Biology and Museum of Southwestern Biology, 1 University of New Mexico, MSC03-2020, Albuquerque, NM 87131, USA
| |
Collapse
|
36
|
Blaimer BB, Mawdsley JR, Brady SG. Multiple origins of sexual dichromatism and aposematism within large carpenter bees. Evolution 2018; 72:1874-1889. [PMID: 30039868 DOI: 10.1111/evo.13558] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2017] [Revised: 07/05/2018] [Accepted: 07/15/2018] [Indexed: 12/24/2022]
Abstract
The evolution of reversed sexual dichromatism and aposematic coloration has long been of interest to both theoreticians and empiricists. Yet despite the potential connections between these phenomena, they have seldom been jointly studied. Large carpenter bees (genus Xylocopa) are a promising group for such comparative investigations as they are a diverse clade in which both aposematism and reversed sexual dichromatism can occur either together or separately. We investigated the evolutionary history of dichromatism and aposematism and a potential correlation of these traits with diversification rates within Xylocopa, using a newly generated phylogeny for 179 Xylocopa species based on ultraconserved elements (UCEs). A monochromatic, inconspicuous ancestor is indicated for the genus, with subsequent convergent evolution of sexual dichromatism and aposematism in multiple lineages. Aposematism is found to covary with reversed sexual dichromatism in many species; however, reversed dichromatism also evolved in non-aposematic species. Bayesian Analysis of Macroevolutionary Models (BAMM) did not show increased diversification in any specific clade in Xylocopa, whereas support from Hidden State Speciation and Extinction (HiSSE) models remained inconclusive regarding an association of increased diversification rates with dichromatism or aposematism. We discuss the evolution of color patterns and diversification in Xylocopa by considering potential drivers of dichromatism and aposematism.
Collapse
Affiliation(s)
- Bonnie B Blaimer
- Department of Entomology, National Museum of Natural History, Smithsonian Institution, Washington, District of Columbia, 20560.,Department of Entomology and Plant Pathology, North Carolina State University, Raleigh, North Carolina, 27695
| | - Jonathan R Mawdsley
- Department of Entomology, National Museum of Natural History, Smithsonian Institution, Washington, District of Columbia, 20560
| | - Seán G Brady
- Department of Entomology, National Museum of Natural History, Smithsonian Institution, Washington, District of Columbia, 20560
| |
Collapse
|
37
|
Herrando-Moraira S. Exploring data processing strategies in NGS target enrichment to disentangle radiations in the tribe Cardueae (Compositae). Mol Phylogenet Evol 2018; 128:69-87. [PMID: 30036700 DOI: 10.1016/j.ympev.2018.07.012] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2018] [Revised: 07/13/2018] [Accepted: 07/14/2018] [Indexed: 12/17/2022]
Abstract
Target enrichment is a cost-effective sequencing technique that holds promise for elucidating evolutionary relationships in fast-evolving lineages. However, potential biases and impact of bioinformatic sequence treatments in phylogenetic inference have not been thoroughly explored yet. Here, we investigate this issue with an ultimate goal to shed light into a highly diversified group of Compositae (Asteraceae) constituted by four main genera: Arctium, Cousinia, Saussurea, and Jurinea. Specifically, we compared sequence data extraction methods implemented in two easy-to-use workflows, PHYLUCE and HybPiper, and assessed the impact of two filtering practices intended to reduce phylogenetic noise. In addition, we compared two phylogenetic inference methods: (1) the concatenation approach, in which all loci were concatenated in a supermatrix; and (2) the coalescence approach, in which gene trees were produced independently and then used to construct a species tree under coalescence assumptions. Here we confirm the usefulness of the set of 1061 COS targets (a nuclear conserved orthology loci set developed for the Compositae) across a variety of taxonomic levels. Intergeneric relationships were completely resolved: there are two sister groups, Arctium-Cousinia and Saussurea-Jurinea, which are in agreement with a morphological hypothesis. Intrageneric relationships among species of Arctium, Cousinia, and Saussurea are also well defined. Conversely, conflicting species relationships remain for Jurinea. Methodological choices significantly affected phylogenies in terms of topology, branch length, and support. Across all analyses, the phylogeny obtained using HybPiper and the strictest scheme of removing fast-evolving sites was estimated as the optimal. Regarding methodological choices, we conclude that: (1) trees obtained under the coalescence approach are topologically more congruent between them than those inferred using the concatenation approach; (2) refining treatments only improved support values under the concatenation approach; and (3) branch support values are maximized when fast-evolving sites are removed in the concatenation approach, and when a higher number of loci is analyzed in the coalescence approach.
Collapse
Affiliation(s)
- Sonia Herrando-Moraira
- Botanic Institute of Barcelona (IBB, CSIC-ICUB), Pg. del Migdia, s.n., 08038 Barcelona, Spain.
| | | |
Collapse
|
38
|
Sayyari E, Whitfield JB, Mirarab S. Fragmentary Gene Sequences Negatively Impact Gene Tree and Species Tree Reconstruction. Mol Biol Evol 2018; 34:3279-3291. [PMID: 29029241 DOI: 10.1093/molbev/msx261] [Citation(s) in RCA: 61] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Species tree reconstruction from genome-wide data is increasingly being attempted, in most cases using a two-step approach of first estimating individual gene trees and then summarizing them to obtain a species tree. The accuracy of this approach, which promises to account for gene tree discordance, depends on the quality of the inferred gene trees. At the same time, phylogenomic and phylotranscriptomic analyses typically use involved bioinformatics pipelines for data preparation. Errors and shortcomings resulting from these preprocessing steps may impact the species tree analyses at the other end of the pipeline. In this article, we first show that the presence of fragmentary data for some species in a gene alignment, as often seen on real data, can result in substantial deterioration of gene trees, and as a result, the species tree. We then investigate a simple filtering strategy where individual fragmentary sequences are removed from individual genes but the rest of the gene is retained. Both in simulations and by reanalyzing a large insect phylotranscriptomic data set, we show the effectiveness of this simple filtering strategy.
Collapse
Affiliation(s)
- Erfan Sayyari
- Department of Electrical and Computer Engineering, University of California at San Diego, La Jolla, CA
| | | | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California at San Diego, La Jolla, CA
| |
Collapse
|
39
|
Abdelkrim J, Aznar-Cormano L, Fedosov AE, Kantor YI, Lozouet P, Phuong MA, Zaharias P, Puillandre N. Exon-Capture-Based Phylogeny and Diversification of the Venomous Gastropods (Neogastropoda, Conoidea). Mol Biol Evol 2018; 35:2355-2374. [DOI: 10.1093/molbev/msy144] [Citation(s) in RCA: 38] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Affiliation(s)
- Jawad Abdelkrim
- Outils et Méthodes de la Systématique Intégrative (OMSI) UMS 2700, Muséum National d’Histoire Naturelle, Paris, France
- Institut Systématique Evolution Biodiversité (ISYEB), Muséum national d’Histoire naturelle, CNRS, Sorbonne Université, EPHE, 57 rue Cuvier, CP 26, 75005 Paris, France
| | - Laetitia Aznar-Cormano
- Outils et Méthodes de la Systématique Intégrative (OMSI) UMS 2700, Muséum National d’Histoire Naturelle, Paris, France
- Institut Systématique Evolution Biodiversité (ISYEB), Muséum national d’Histoire naturelle, CNRS, Sorbonne Université, EPHE, 57 rue Cuvier, CP 26, 75005 Paris, France
| | - Alexander E Fedosov
- A.N. Severtzov Institute of Ecology and Evolution, Russian Academy of Sciences, Leninski prospect 33, 119071 Moscow, Russian Federation
| | - Yuri I Kantor
- A.N. Severtzov Institute of Ecology and Evolution, Russian Academy of Sciences, Leninski prospect 33, 119071 Moscow, Russian Federation
| | - Pierre Lozouet
- Muséum National d’Histoire Naturelle, Direction des Collections, 55, rue Buffon, 75005 Paris, France
| | - Mark A Phuong
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA 90095, USA
| | - Paul Zaharias
- Institut Systématique Evolution Biodiversité (ISYEB), Muséum national d’Histoire naturelle, CNRS, Sorbonne Université, EPHE, 57 rue Cuvier, CP 26, 75005 Paris, France
| | - Nicolas Puillandre
- Institut Systématique Evolution Biodiversité (ISYEB), Muséum national d’Histoire naturelle, CNRS, Sorbonne Université, EPHE, 57 rue Cuvier, CP 26, 75005 Paris, France
| |
Collapse
|
40
|
SVDquest: Improving SVDquartets species tree estimation using exact optimization within a constrained search space. Mol Phylogenet Evol 2018. [DOI: 10.1016/j.ympev.2018.03.006] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
41
|
Zhang C, Rabiee M, Sayyari E, Mirarab S. ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinformatics 2018; 19:153. [PMID: 29745866 PMCID: PMC5998893 DOI: 10.1186/s12859-018-2129-y] [Citation(s) in RCA: 1036] [Impact Index Per Article: 172.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Background Evolutionary histories can be discordant across the genome, and such discordances need to be considered in reconstructing the species phylogeny. ASTRAL is one of the leading methods for inferring species trees from gene trees while accounting for gene tree discordance. ASTRAL uses dynamic programming to search for the tree that shares the maximum number of quartet topologies with input gene trees, restricting itself to a predefined set of bipartitions. Results We introduce ASTRAL-III, which substantially improves the running time of ASTRAL-II and guarantees polynomial running time as a function of both the number of species (n) and the number of genes (k). ASTRAL-III limits the bipartition constraint set (X) to grow at most linearly with n and k. Moreover, it handles polytomies more efficiently than ASTRAL-II, exploits similarities between gene trees better, and uses several techniques to avoid searching parts of the search space that are mathematically guaranteed not to include the optimal tree. The asymptotic running time of ASTRAL-III in the presence of polytomies is \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$O\left ((nk)^{1.726} D \right)$\end{document}O(nk)1.726D where D=O(nk) is the sum of degrees of all unique nodes in input trees. The running time improvements enable us to test whether contracting low support branches in gene trees improves the accuracy by reducing noise. In extensive simulations, we show that removing branches with very low support (e.g., below 10%) improves accuracy while overly aggressive filtering is harmful. We observe on a biological avian phylogenomic dataset of 14K genes that contracting low support branches greatly improve results. Conclusions ASTRAL-III is a faster version of the ASTRAL method for phylogenetic reconstruction and can scale up to 10,000 species. With ASTRAL-III, low support branches can be removed, resulting in improved accuracy. Electronic supplementary material The online version of this article (10.1186/s12859-018-2129-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Chao Zhang
- Bioinformatics and Systems Biology, University of California at San Diego, 9500 Gilman Drive, La Jolla, 92093-0021, CA, USA
| | - Maryam Rabiee
- Department of Computer Science and Engineering, University of California at San Diego, 9500 Gilman Drive, La Jolla, 92093-0021, CA, USA
| | - Erfan Sayyari
- Department of Electrical and Computer Engineering, University of California at San Diego, 9500 Gilman Drive, La Jolla, 92093-0021, CA, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California at San Diego, 9500 Gilman Drive, La Jolla, 92093-0021, CA, USA.
| |
Collapse
|
42
|
Affiliation(s)
- David Posada
- Department of Biochemistry, Genetics and Immunology, University of Vigo, 36310 Vigo, Spain
| |
Collapse
|
43
|
Platt RN, Faircloth BC, Sullivan KAM, Kieran TJ, Glenn TC, Vandewege MW, Lee TE, Baker RJ, Stevens RD, Ray DA. Conflicting Evolutionary Histories of the Mitochondrial and Nuclear Genomes in New World Myotis Bats. Syst Biol 2018; 67:236-249. [PMID: 28945862 PMCID: PMC5837689 DOI: 10.1093/sysbio/syx070] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2017] [Revised: 07/31/2017] [Accepted: 08/15/2017] [Indexed: 01/05/2023] Open
Abstract
The rapid diversification of Myotis bats into more than 100 species is one of the most extensive mammalian radiations available for study. Efforts to understand relationships within Myotis have primarily utilized mitochondrial markers and trees inferred from nuclear markers lacked resolution. Our current understanding of relationships within Myotis is therefore biased towards a set of phylogenetic markers that may not reflect the history of the nuclear genome. To resolve this, we sequenced the full mitochondrial genomes of 37 representative Myotis, primarily from the New World, in conjunction with targeted sequencing of 3648 ultraconserved elements (UCEs). We inferred the phylogeny and explored the effects of concatenation and summary phylogenetic methods, as well as combinations of markers based on informativeness or levels of missing data, on our results. Of the 294 phylogenies generated from the nuclear UCE data, all are significantly different from phylogenies inferred using mitochondrial genomes. Even within the nuclear data, quartet frequencies indicate that around half of all UCE loci conflict with the estimated species tree. Several factors can drive such conflict, including incomplete lineage sorting, introgressive hybridization, or even phylogenetic error. Despite the degree of discordance between nuclear UCE loci and the mitochondrial genome and among UCE loci themselves, the most common nuclear topology is recovered in one quarter of all analyses with strong nodal support. Based on these results, we re-examine the evolutionary history of Myotis to better understand the phenomena driving their unique nuclear, mitochondrial, and biogeographic histories.
Collapse
Affiliation(s)
- Roy N Platt
- Department of Biological Sciences, Texas Tech University, 2901 Main St, Lubbock, TX, USA
| | - Brant C Faircloth
- Department of Biological Sciences and Museum of Natural Science, Louisiana State University, 202 Life Science Building, Baton Rouge, LA, USA
| | - Kevin A M Sullivan
- Department of Biological Sciences, Texas Tech University, 2901 Main St, Lubbock, TX, USA
| | - Troy J Kieran
- Department of Environmental Health Science, University of Georgia, 206 Environmental Health Sciences Building, Athens, GA, USA
| | - Travis C Glenn
- Department of Environmental Health Science, University of Georgia, 206 Environmental Health Sciences Building, Athens, GA, USA
| | - Michael W Vandewege
- Department of Biological Sciences, Texas Tech University, 2901 Main St, Lubbock, TX, USA
| | - Thomas E Lee
- Department of Biology, Abilene Christian University, 1600 Campus Ct. Abilene, TX, USA
| | - Robert J Baker
- Department of Biological Sciences, Texas Tech University, 2901 Main St, Lubbock, TX, USA
| | - Richard D Stevens
- Natural Resource Management, Texas Tech University, 2901 Main St, Lubbock, TX, USA
| | - David A Ray
- Department of Biological Sciences, Texas Tech University, 2901 Main St, Lubbock, TX, USA
| |
Collapse
|
44
|
Kates HR, Johnson MG, Gardner EM, Zerega NJC, Wickett NJ. Allele phasing has minimal impact on phylogenetic reconstruction from targeted nuclear gene sequences in a case study of Artocarpus. AMERICAN JOURNAL OF BOTANY 2018; 105:404-416. [PMID: 29729187 DOI: 10.1002/ajb2.1068] [Citation(s) in RCA: 38] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/05/2017] [Accepted: 01/29/2018] [Indexed: 05/12/2023]
Abstract
PREMISE OF THE STUDY Untapped information about allele diversity within populations and individuals (i.e., heterozygosity) could improve phylogenetic resolution and accuracy. Many phylogenetic reconstructions ignore heterozygosity because it is difficult to assemble allele sequences and combine allele data across unlinked loci, and it is unclear how reconstruction methods accommodate variable sequences. We review the common methods of including heterozygosity in phylogenetic studies and present a novel method for assembling allele sequences from target-enriched Illumina sequencing libraries. METHODS We performed supermatrix phylogeny reconstruction and species tree estimation of Artocarpus based on three methods of accounting for heterozygous sequences: a consensus method based on de novo sequence assembly, the use of ambiguity characters, and a novel method for incorporating read information to phase alleles. We characterize the extent to which highly heterozygous sequences impeded phylogeny reconstruction and determine whether the use of allele sequences improves phylogenetic resolution or decreases topological uncertainty. KEY RESULTS We show here that it is possible to infer phased alleles from target-enriched Illumina libraries. We find that highly heterozygous sequences do not contribute disproportionately to poor phylogenetic resolution and that the use of allele sequences for phylogeny reconstruction does not have a clear effect on phylogenetic resolution or topological consistency. CONCLUSIONS We provide a framework for inferring phased alleles from target enrichment data and for assessing the contribution of allelic diversity to phylogenetic reconstruction. In our data set, the impact of allele phasing on phylogeny is minimal compared to the impact of using phylogenetic reconstruction methods that account for gene tree incongruence.
Collapse
Affiliation(s)
- Heather Rose Kates
- Genetics Institute, University of Florida, P.O. Box 103610, Gainesville, FL, 32611, USA
- Florida Museum of Natural History, University of Florida, Gainesville, FL, 32611, USA
| | - Matthew G Johnson
- Department of Plant Sciences, Chicago Botanic Garden, 1000 Lake Cook Road, Glencoe, IL, 60022, USA
- Department of Biological Sciences, Texas Tech University, 2401 Main Street, Lubbock, TX, 79414, USA
| | - Elliot M Gardner
- Department of Plant Sciences, Chicago Botanic Garden, 1000 Lake Cook Road, Glencoe, IL, 60022, USA
- Plant Biology and Conservation, Northwestern University, 2205 Tech Drive Hogan 2-144, Evanston, IL, 60208, USA
| | - Nyree J C Zerega
- Department of Plant Sciences, Chicago Botanic Garden, 1000 Lake Cook Road, Glencoe, IL, 60022, USA
- Plant Biology and Conservation, Northwestern University, 2205 Tech Drive Hogan 2-144, Evanston, IL, 60208, USA
| | - Norman J Wickett
- Department of Plant Sciences, Chicago Botanic Garden, 1000 Lake Cook Road, Glencoe, IL, 60022, USA
- Plant Biology and Conservation, Northwestern University, 2205 Tech Drive Hogan 2-144, Evanston, IL, 60208, USA
| |
Collapse
|
45
|
Springer MS, Gatesy J. Delimiting Coalescence Genes (C-Genes) in Phylogenomic Data Sets. Genes (Basel) 2018; 9:genes9030123. [PMID: 29495400 PMCID: PMC5867844 DOI: 10.3390/genes9030123] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2017] [Revised: 02/02/2018] [Accepted: 02/19/2018] [Indexed: 02/07/2023] Open
Abstract
coalescence methods have emerged as a popular alternative for inferring species trees with large genomic datasets, because these methods explicitly account for incomplete lineage sorting. However, statistical consistency of summary coalescence methods is not guaranteed unless several model assumptions are true, including the critical assumption that recombination occurs freely among but not within coalescence genes (c-genes), which are the fundamental units of analysis for these methods. Each c-gene has a single branching history, and large sets of these independent gene histories should be the input for genome-scale coalescence estimates of phylogeny. By contrast, numerous studies have reported the results of coalescence analyses in which complete protein-coding sequences are treated as c-genes even though exons for these loci can span more than a megabase of DNA. Empirical estimates of recombination breakpoints suggest that c-genes may be much shorter, especially when large clades with many species are the focus of analysis. Although this idea has been challenged recently in the literature, the inverse relationship between c-gene size and increased taxon sampling in a dataset-the 'recombination ratchet'-is a fundamental property of c-genes. For taxonomic groups characterized by genes with long intron sequences, complete protein-coding sequences are likely not valid c-genes and are inappropriate units of analysis for summary coalescence methods unless they occur in recombination deserts that are devoid of incomplete lineage sorting (ILS). Finally, it has been argued that coalescence methods are robust when the no-recombination within loci assumption is violated, but recombination must matter at some scale because ILS, a by-product of recombination, is the raison d'etre for coalescence methods. That is, extensive recombination is required to yield the large number of independently segregating c-genes used to infer a species tree. If coalescent methods are powerful enough to infer the correct species tree for difficult phylogenetic problems in the anomaly zone, where concatenation is expected to fail because of ILS, then there should be a decreasing probability of inferring the correct species tree using longer loci with many intralocus recombination breakpoints (i.e., increased levels of concatenation).
Collapse
Affiliation(s)
- Mark S Springer
- Department of Evolution, Ecology, and Organismal Biology, University of California, Riverside, CA 92521, USA.
| | - John Gatesy
- Division of Vertebrate Zoology and Sackler Institute for Comparative Genomics, American Museum of Natural History, New York, NY 10024, USA.
| |
Collapse
|
46
|
Scornavacca C, Galtier N. Incomplete Lineage Sorting in Mammalian Phylogenomics. Syst Biol 2018; 66:112-120. [PMID: 28173480 DOI: 10.1093/sysbio/syw082] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2016] [Revised: 03/25/2016] [Accepted: 09/04/2016] [Indexed: 01/05/2023] Open
Abstract
The impact of incomplete lineage sorting (ILS) on phylogenetic conflicts among genes, and the related issue of whether to account for ILS in species tree reconstruction, are matters of intense controversy. Here, focusing on full-genome data in placental mammals, we empirically test two assumptions underlying current usage of tree-building methods that account for ILS. We show that in this data set (i) distinct exons from a common gene do not share a common genealogy, and (ii) ILS is only a minor determinant of the existing phylogenetic conflict. These results shed new light on the relevance and conditions of applicability of ILS-aware methods in phylogenomic analyses of protein coding sequences.
Collapse
Affiliation(s)
- Celine Scornavacca
- UMR 5554-Institute of Evolutionary Sciences, University Montpellier, CNRS, IRD, EPHE, Place E. Bataillon-CC64, Montpellier, France
| | - Nicolas Galtier
- UMR 5554-Institute of Evolutionary Sciences, University Montpellier, CNRS, IRD, EPHE, Place E. Bataillon-CC64, Montpellier, France
| |
Collapse
|
47
|
Streicher JW, Miller EC, Guerrero PC, Correa C, Ortiz JC, Crawford AJ, Pie MR, Wiens JJ. Evaluating methods for phylogenomic analyses, and a new phylogeny for a major frog clade (Hyloidea) based on 2214 loci. Mol Phylogenet Evol 2018; 119:128-143. [DOI: 10.1016/j.ympev.2017.10.013] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2017] [Revised: 10/21/2017] [Accepted: 10/22/2017] [Indexed: 01/28/2023]
|
48
|
Abstract
Phylogenomics aims at reconstructing the evolutionary histories of organisms taking into account whole genomes or large fractions of genomes. The abundance of genomic data for an enormous variety of organisms has enabled phylogenomic inference of many groups, and this has motivated the development of many computer programs implementing the associated methods. This chapter surveys phylogenetic concepts and methods aimed at both gene tree and species tree reconstruction while also addressing common pitfalls, providing references to relevant computer programs. A practical phylogenomic analysis example including bacterial genomes is presented at the end of the chapter.
Collapse
Affiliation(s)
- José S L Patané
- Department of Biochemistry, Institute of Chemistry, University of São Paulo, Av. Prof. Lineu Prestes 748, São Paulo, SP, 05508-000, Brazil
| | - Joaquim Martins
- Department of Biochemistry, Institute of Chemistry, University of São Paulo, Av. Prof. Lineu Prestes 748, São Paulo, SP, 05508-000, Brazil
| | - João C Setubal
- Department of Biochemistry, Institute of Chemistry, University of São Paulo, Av. Prof. Lineu Prestes 748, São Paulo, SP, 05508-000, Brazil.
| |
Collapse
|
49
|
Mallo D, Posada D. Multilocus inference of species trees and DNA barcoding. Philos Trans R Soc Lond B Biol Sci 2017; 371:rstb.2015.0335. [PMID: 27481787 PMCID: PMC4971187 DOI: 10.1098/rstb.2015.0335] [Citation(s) in RCA: 50] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/10/2016] [Indexed: 11/30/2022] Open
Abstract
The unprecedented amount of data resulting from next-generation sequencing has opened a new era in phylogenetic estimation. Although large datasets should, in theory, increase phylogenetic resolution, massive, multilocus datasets have uncovered a great deal of phylogenetic incongruence among different genomic regions, due both to stochastic error and to the action of different evolutionary process such as incomplete lineage sorting, gene duplication and loss and horizontal gene transfer. This incongruence violates one of the fundamental assumptions of the DNA barcoding approach, which assumes that gene history and species history are identical. In this review, we explain some of the most important challenges we will have to face to reconstruct the history of species, and the advantages and disadvantages of different strategies for the phylogenetic analysis of multilocus data. In particular, we describe the evolutionary events that can generate species tree—gene tree discordance, compare the most popular methods for species tree reconstruction, highlight the challenges we need to face when using them and discuss their potential utility in barcoding. Current barcoding methods sacrifice a great amount of statistical power by only considering one locus, and a transition to multilocus barcodes would not only improve current barcoding methods, but also facilitate an eventual transition to species-tree-based barcoding strategies, which could better accommodate scenarios where the barcode gap is too small or inexistent. This article is part of the themed issue ‘From DNA barcodes to biomes’.
Collapse
Affiliation(s)
- Diego Mallo
- Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo 36310, Spain
| | - David Posada
- Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo 36310, Spain
| |
Collapse
|
50
|
Molloy EK, Warnow T. To Include or Not to Include: The Impact of Gene Filtering on Species Tree Estimation Methods. Syst Biol 2017; 67:285-303. [DOI: 10.1093/sysbio/syx077] [Citation(s) in RCA: 138] [Impact Index Per Article: 19.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2016] [Accepted: 09/13/2017] [Indexed: 01/27/2023] Open
Affiliation(s)
- Erin K Molloy
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| |
Collapse
|