1
|
Gallone B, Kuyper TW, Nuytinck J. The genus Cortinarius should not (yet) be split. IMA Fungus 2024; 15:24. [PMID: 39138570 PMCID: PMC11321212 DOI: 10.1186/s43008-024-00159-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2024] [Accepted: 07/25/2024] [Indexed: 08/15/2024] Open
Abstract
The genus Cortinarius (Agaricales, Basidiomycota) is one of the most species-rich fungal genera, with thousands of species reported. Cortinarius species are important ectomycorrhizal fungi and form associations with many vascular plants globally. Until recently Cortinarius was the single genus of the family Cortinariaceae, despite several attempts to provide a workable, lower-rank hierarchical structure based on subgenera and sections. The first phylogenomic study for this group elevated the old genus Cortinarius to family level and the family was split into ten genera, of which seven were described as new. Here, by careful re-examination of the recently published phylogenomic dataset, we detected extensive gene-tree/species-tree conflicts using both concatenation and multispecies coalescent approaches. Our analyses demonstrate that the Cortinarius phylogeny remains unresolved and the resulting phylogenomic hypotheses suffer from very short and unsupported branches in the backbone. We can confirm monophyly of only four out of ten suggested new genera, leaving uncertain the relationships between each other and the general branching order. Thorough exploration of the tree space demonstrated that the topology on which Cortinarius revised classification relies on does not represent the best phylogenetic hypothesis and should not be used as constrained topology to include additional species. For this reason, we argue that based on available evidence the genus Cortinarius should not (yet) be split. Moreover, considering that phylogenetic uncertainty translates to taxonomic uncertainty, we advise for careful evaluation of phylogenomic datasets before proposing radical taxonomic and nomenclatural changes.
Collapse
Affiliation(s)
- Brigida Gallone
- Naturalis Biodiversity Center, Darwinweg 2, 2333 CR, Leiden, The Netherlands.
| | - Thomas W Kuyper
- Naturalis Biodiversity Center, Darwinweg 2, 2333 CR, Leiden, The Netherlands
- Soil Biology Group, Wageningen University, 6700 AA, Wageningen, The Netherlands
| | - Jorinde Nuytinck
- Naturalis Biodiversity Center, Darwinweg 2, 2333 CR, Leiden, The Netherlands
- Research Group Mycology, Department of Biology, Ghent University, K.L. Ledeganckstraat 35, 9000, Ghent, Belgium
| |
Collapse
|
2
|
McKibben MTW, Finch G, Barker MS. Species-tree topology impacts the inference of ancient whole-genome duplications across the angiosperm phylogeny. AMERICAN JOURNAL OF BOTANY 2024; 111:e16378. [PMID: 39039654 DOI: 10.1002/ajb2.16378] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Revised: 06/11/2024] [Accepted: 06/12/2024] [Indexed: 07/24/2024]
Abstract
PREMISE The history of angiosperms is marked by repeated rounds of ancient whole-genome duplications (WGDs). Here we used state-of-the-art methods to provide an up-to-date view of the distribution of WGDs in the history of angiosperms that considers both uncertainty introduced by different WGD inference methods and different underlying species-tree hypotheses. METHODS We used the distribution synonymous divergences (Ks) of paralogs and orthologs from transcriptomic and genomic data to infer and place WGDs across two hypothesized angiosperm phylogenies. We further tested these WGD hypotheses with syntenic inferences and Bayesian models of duplicate gene gain and loss. RESULTS The predicted number of WGDs in the history of angiosperms (~170) based on the current taxon sampling is largely similar across different inference methods, but varies in the precise placement of WGDs on the phylogeny. Ks-based methods often yield alternative hypothesized WGD placements due to variation in substitution rates among lineages. Phylogenetic models of duplicate gene gain and loss are more robust to topological variation. However, errors in species-tree inference can still produce spurious WGD hypotheses, regardless of method used. CONCLUSIONS Here we showed that different WGD inference methods largely agree on an average of 3.5 WGD in the history of individual angiosperm species. However, the precise placement of WGDs on the phylogeny is subject to the WGD inference method and tree topology. As researchers continue to test hypotheses regarding the impacts ancient WGDs have on angiosperm evolution, it is important to consider the uncertainty of the phylogeny as well as WGD inference methods.
Collapse
Affiliation(s)
- Michael T W McKibben
- Department of Ecology & Evolutionary Biology, University of Arizona, Tucson, AZ, USA
| | - Geoffrey Finch
- Department of Ecology & Evolutionary Biology, University of Arizona, Tucson, AZ, USA
| | - Michael S Barker
- Department of Ecology & Evolutionary Biology, University of Arizona, Tucson, AZ, USA
| |
Collapse
|
3
|
Mirarab S, Rivas-González I, Feng S, Stiller J, Fang Q, Mai U, Hickey G, Chen G, Brajuka N, Fedrigo O, Formenti G, Wolf JBW, Howe K, Antunes A, Schierup MH, Paten B, Jarvis ED, Zhang G, Braun EL. A region of suppressed recombination misleads neoavian phylogenomics. Proc Natl Acad Sci U S A 2024; 121:e2319506121. [PMID: 38557186 PMCID: PMC11009670 DOI: 10.1073/pnas.2319506121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 02/07/2024] [Indexed: 04/04/2024] Open
Abstract
Genomes are typically mosaics of regions with different evolutionary histories. When speciation events are closely spaced in time, recombination makes the regions sharing the same history small, and the evolutionary history changes rapidly as we move along the genome. When examining rapid radiations such as the early diversification of Neoaves 66 Mya, typically no consistent history is observed across segments exceeding kilobases of the genome. Here, we report an exception. We found that a 21-Mb region in avian genomes, mapped to chicken chromosome 4, shows an extremely strong and discordance-free signal for a history different from that of the inferred species tree. Such a strong discordance-free signal, indicative of suppressed recombination across many millions of base pairs, is not observed elsewhere in the genome for any deep avian relationships. Although long regions with suppressed recombination have been documented in recently diverged species, our results pertain to relationships dating circa 65 Mya. We provide evidence that this strong signal may be due to an ancient rearrangement that blocked recombination and remained polymorphic for several million years prior to fixation. We show that the presence of this region has misled previous phylogenomic efforts with lower taxon sampling, showing the interplay between taxon and locus sampling. We predict that similar ancient rearrangements may confound phylogenetic analyses in other clades, pointing to a need for new analytical models that incorporate the possibility of such events.
Collapse
Affiliation(s)
- Siavash Mirarab
- Electrical and Computer Engineering Department, University of California, San Diego, CA95032
| | | | - Shaohong Feng
- Center for Evolutionary & Organismal Biology, Zhejiang University School of Medicine, Hangzhou310058, China
- Liangzhu Laboratory, Zhejiang University, Hangzhou311121, China
| | - Josefin Stiller
- Section for Ecology & Evolution, Department of Biology, University of Copenhagen, København2100, Denmark
| | - Qi Fang
- BGI-Research, Shenzhen518083, China
| | - Uyen Mai
- Electrical and Computer Engineering Department, University of California, San Diego, CA95032
| | - Glenn Hickey
- Genomics Institute, University of California, Santa Cruz, CA96064
| | - Guangji Chen
- Center for Evolutionary & Organismal Biology, Zhejiang University School of Medicine, Hangzhou310058, China
- Liangzhu Laboratory, Zhejiang University, Hangzhou311121, China
| | - Nadolina Brajuka
- Vertebrate Genome Lab, Rockefeller University, New York, NY10065
| | - Olivier Fedrigo
- Vertebrate Genome Lab, Rockefeller University, New York, NY10065
| | - Giulio Formenti
- Vertebrate Genome Lab, Rockefeller University, New York, NY10065
| | - Jochen B. W. Wolf
- Division of Evolutionary Biology, Faculty of Biology, Ludwig-Maximillians-Universität, Munich82152, Germany
| | - Kerstin Howe
- Tree of Life Division, Wellcome Sanger Institute, CambridgeCB10 1RQ, United Kingdom
| | - Agostinho Antunes
- Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Porto4099-002, Portugal
- Department of Biology, Faculty of Sciences, University of Porto, Porto4099-002, Portugal
| | | | - Benedict Paten
- Genomics Institute, University of California, Santa Cruz, CA96064
| | - Erich D. Jarvis
- Vertebrate Genome Lab, Rockefeller University, New York, NY10065
| | - Guojie Zhang
- Center for Evolutionary & Organismal Biology, Zhejiang University School of Medicine, Hangzhou310058, China
| | - Edward L. Braun
- Department of Biology, University of Florida, Gainesville, FL32611
| |
Collapse
|
4
|
Tabatabaee Y, Roch S, Warnow T. QR-STAR: A Polynomial-Time Statistically Consistent Method for Rooting Species Trees Under the Coalescent. J Comput Biol 2023; 30:1146-1181. [PMID: 37902986 DOI: 10.1089/cmb.2023.0185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2023] Open
Abstract
We address the problem of rooting an unrooted species tree given a set of unrooted gene trees, under the assumption that gene trees evolve within the model species tree under the multispecies coalescent (MSC) model. Quintet Rooting (QR) is a polynomial time algorithm that was recently proposed for this problem, which is based on the theory developed by Allman, Degnan, and Rhodes that proves the identifiability of rooted 5-taxon trees from unrooted gene trees under the MSC. However, although QR had good accuracy in simulations, its statistical consistency was left as an open problem. We present QR-STAR, a variant of QR with an additional step and a different cost function, and prove that it is statistically consistent under the MSC. Moreover, we derive sample complexity bounds for QR-STAR and show that a particular variant of it based on "short quintets" has polynomial sample complexity. Finally, our simulation study under a variety of model conditions shows that QR-STAR matches or improves on the accuracy of QR. QR-STAR is available in open-source form on github.
Collapse
Affiliation(s)
- Yasamin Tabatabaee
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| | - Sebastien Roch
- Department of Mathematics, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| |
Collapse
|
5
|
DeSalle R, Narechania A, Tessler M. Multiple Outgroups Can Cause Random Rooting in Phylogenomics. Mol Phylogenet Evol 2023; 184:107806. [PMID: 37172862 DOI: 10.1016/j.ympev.2023.107806] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Revised: 02/06/2023] [Accepted: 04/26/2023] [Indexed: 05/15/2023]
Abstract
Outgroup selection has been a major challenge since the rise of phylogenetics, and it has remained so in the phylogenomic era. Our goal here is to use large phylogenomic animal datasets to examine the impact of outgroup selection on the final topology. The results of our analyses further solidify the fact that distant outgroups can cause random rooting, and that this holds for concatenated and coalescent-based methods. The results also indicate that the standard practice of using multiple outgroups often causes random rooting. Most researchers go out of their way to get multiple outgroups, as this has been standard practice for decades. Based on our findings, this practice should stop. Instead, our results suggest that a single (most closely) related relative should be selected as the outgroup, unless all outgroups are roughly equally closely related to the ingroup.
Collapse
Affiliation(s)
- Rob DeSalle
- Institute for Comparative Genomics, American Museum of Natural History, New York, NY 10024, USA; Division of Invertebrate Zoology, American Museum of Natural History, New York, NY 10024, USA
| | - Apurva Narechania
- Institute for Comparative Genomics, American Museum of Natural History, New York, NY 10024, USA
| | - Michael Tessler
- Institute for Comparative Genomics, American Museum of Natural History, New York, NY 10024, USA; Division of Invertebrate Zoology, American Museum of Natural History, New York, NY 10024, USA; St. Francis College, Department of Biology, Brooklyn, NY 11201, USA
| |
Collapse
|
6
|
Hill M, Legried B, Roch S. Species tree estimation under joint modeling of coalescence and duplication: Sample complexity of quartet methods. ANN APPL PROBAB 2022. [DOI: 10.1214/22-aap1799] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Affiliation(s)
- Max Hill
- Department of Mathematics, University of Wisconsin–Madison
| | | | - Sebastien Roch
- Department of Mathematics, University of Wisconsin–Madison
| |
Collapse
|
7
|
Zhang C, Mirarab S. Weighting by Gene Tree Uncertainty Improves Accuracy of Quartet-based Species Trees. Mol Biol Evol 2022; 39:6750035. [PMID: 36201617 PMCID: PMC9750496 DOI: 10.1093/molbev/msac215] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 09/20/2022] [Accepted: 10/03/2022] [Indexed: 01/07/2023] Open
Abstract
Phylogenomic analyses routinely estimate species trees using methods that account for gene tree discordance. However, the most scalable species tree inference methods, which summarize independently inferred gene trees to obtain a species tree, are sensitive to hard-to-avoid errors introduced in the gene tree estimation step. This dilemma has created much debate on the merits of concatenation versus summary methods and practical obstacles to using summary methods more widely and to the exclusion of concatenation. The most successful attempt at making summary methods resilient to noisy gene trees has been contracting low support branches from the gene trees. Unfortunately, this approach requires arbitrary thresholds and poses new challenges. Here, we introduce threshold-free weighting schemes for the quartet-based species tree inference, the metric used in the popular method ASTRAL. By reducing the impact of quartets with low support or long terminal branches (or both), weighting provides stronger theoretical guarantees and better empirical performance than the unweighted ASTRAL. Our simulations show that weighting improves accuracy across many conditions and reduces the gap with concatenation in conditions with low gene tree discordance and high noise. On empirical data, weighting improves congruence with concatenation and increases support. Together, our results show that weighting, enabled by a new optimization algorithm we introduce, improves the utility of summary methods and can reduce the incongruence often observed across analytical pipelines.
Collapse
Affiliation(s)
- Chao Zhang
- Bioinformatics and Systems Biology, UC San Diego, La Jolla, CA, USA
| | | |
Collapse
|
8
|
Chan YB, Li Q, Scornavacca C. The large-sample asymptotic behaviour of quartet-based summary methods for species tree inference. J Math Biol 2022; 85:22. [PMID: 35976512 PMCID: PMC9385842 DOI: 10.1007/s00285-022-01786-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Revised: 06/08/2022] [Accepted: 07/14/2022] [Indexed: 12/03/2022]
Abstract
Summary methods seek to infer a species tree from a set of gene trees. A desirable property of such methods is that of statistical consistency; that is, the probability of inferring the wrong species tree (the error probability) tends to 0 as the number of input gene trees becomes large. A popular paradigm is to infer a species tree that agrees with the maximum number of quartets from the input set of gene trees; this has been proved to be statistically consistent under several models of gene evolution. In this paper, we study the asymptotic behaviour of the error probability of such methods in this limit, and show that it decays exponentially. For a 4-taxon species tree, we derive a closed form for the asymptotic behaviour in terms of the probability that the gene evolution process produces the correct topology. We also derive bounds for the sample complexity (the number of gene trees required to infer the true species tree with a given probability), which outperform existing bounds. We then extend our results to bounds for the asymptotic behaviour of the error probability for any species tree, and compare these to the true error probability for some model species trees using simulations.
Collapse
Affiliation(s)
- Yao-Ban Chan
- School of Mathematics and Statistics / Melbourne Integrative Genomics, The University of Melbourne, Melbourne, 3010, VIC, Australia.
| | - Qiuyi Li
- School of Mathematics and Statistics / Melbourne Integrative Genomics, The University of Melbourne, Melbourne, 3010, VIC, Australia
| | - Celine Scornavacca
- Institut des Sciences de l'Evolution, Université Montpellier, CNRS, EPHE, IRD, Montpellier, 34095, France
| |
Collapse
|
9
|
Kim A, Degnan JH. Heuristics for unrooted, unranked, and ranked anomaly zones under birth-death models. Mol Phylogenet Evol 2021; 161:107162. [PMID: 33831548 DOI: 10.1016/j.ympev.2021.107162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2019] [Revised: 10/21/2020] [Accepted: 03/23/2021] [Indexed: 10/21/2022]
Abstract
Species trees that can generate a nonmatching gene tree topology that is more probable than the topology matching the species tree are said to be in an anomaly zone. We introduce some heuristic approaches to infer whether species trees are in anomaly zones when it is difficult or impossible to compute the entire distribution of gene tree topologies. Here, probabilities of unrooted, unranked, and ranked gene tree topologies under the multispecies coalescent are used. A ranked tree can be viewed as an unranked tree with a temporal ordering of its internal nodes. Overall, considering probabilities of unrooted or unranked gene tree topologies within one nearest neighbor interchange from the species tree topology is a reasonable heuristic to infer the existence of anomalous unrooted or unranked gene trees, respectively. We investigated a test proposed by Linkem et al. (2016) which classifies a species tree as being in an unranked anomaly zone if there is a subset of four taxa in an unranked anomaly zone. We find this test to have high true positive rates, but it can also have high false positive rates. For ranked trees, because at least one of the most probable ranked gene tree topologies must have the same unranked topology as the species tree, we propose to use only those ranked gene trees that have topologies that match the unranked species tree topology. We find that the probability that the species tree is in unrooted and unranked anomaly zones tends to increase with the speciation rate, and the probability of all three types of anomaly zones increases rapidly with the number of taxa. We find that probabilities that species trees are in an anomaly zone can be quite high for moderately high speciation rates.
Collapse
Affiliation(s)
- Anastasiia Kim
- Department of Mathematics and Statistics, University of New Mexico, United States
| | - James H Degnan
- Department of Mathematics and Statistics, University of New Mexico, United States
| |
Collapse
|
10
|
Dibaeinia P, Tabe-Bordbar S, Warnow T. FASTRAL: Improving scalability of phylogenomic analysis. Bioinformatics 2021; 37:2317-2324. [PMID: 33576396 PMCID: PMC8388037 DOI: 10.1093/bioinformatics/btab093] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2020] [Revised: 02/02/2021] [Accepted: 02/04/2021] [Indexed: 01/22/2023] Open
Abstract
MOTIVATION ASTRAL is the current leading method for species tree estimation from phylogenomic datasets (i.e., hundreds to thousands of genes) that addresses gene tree discord resulting from incomplete lineage sorting (ILS). ASTRAL is statistically consistent under the multi-locus coalescent model (MSC), runs in polynomial time, and is able to run on large datasets. Key to ASTRAL's algorithm is the use of dynamic programming to find an optimal solution to the MQSST (maximum quartet support supertree) within a constraint space that it computes from the input. Yet, ASTRAL can fail to complete within reasonable timeframes on large datasets with many genes and species, because in these cases the constraint space it computes is too large. RESULTS Here we introduce FASTRAL, a phylogenomic estimation method. FASTRAL is based on ASTRAL, but uses a different technique for constructing the constraint space. The technique we use to define the constraint space maintains statistical consistency and is polynomial time; thus we prove that FASTRAL is a polynomial time algorithm that is statistically consistent under the MSC. Our performance study on both biological and simulated data sets demonstrates that FASTRAL matches or improves on ASTRAL with respect to species tree topology accuracy (and under high ILS conditions it is statistically significantly more accurate), while being dramatically faster-especially on datasets with large numbers of genes and high ILS-due to using a significantly smaller constraint space. AVAILABILITY FASTRAL is available in open-source form at https://github.com/PayamDiba/FASTRAL. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Payam Dibaeinia
- Department of Computer Science, University of Illinois, Urbana, IL 61801, USA
| | - Shayan Tabe-Bordbar
- Department of Computer Science, University of Illinois, Urbana, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois, Urbana, IL 61801, USA,To whom correspondence should be addressed.
| |
Collapse
|
11
|
Le T, Sy A, Molloy EK, Zhang Q, Rao S, Warnow T. Using Constrained-INC for Large-Scale Gene Tree and Species Tree Estimation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2-15. [PMID: 32750844 DOI: 10.1109/tcbb.2020.2990867] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Incremental tree building (INC) is a new phylogeny estimation method that has been proven to be absolute fast converging under standard sequence evolution models. A variant of INC, called Constrained-INC, is designed for use in divide-and-conquer pipelines for phylogeny estimation where a set of species is divided into disjoint subsets, trees are computed on the subsets using a selected base method, and then the subset trees are combined together. We evaluate the accuracy of INC and Constrained-INC for gene tree and species tree estimation on simulated datasets, and compare it to similar pipelines using NJMerge (another method that merges disjoint trees). For gene tree estimation, we find that INC has very poor accuracy in comparison to standard methods, and even Constrained-INC(using maximum likelihood methods to compute constraint trees) does not match the accuracy of the better maximum likelihood methods. Results for species trees are somewhat different, with Constrained-INC coming close to the accuracy of the best species tree estimation methods, while being much faster; furthermore, using Constrained-INC allows species tree estimation methods to scale to large datasets within limited computational resources. Overall, this study exposes the benefits and limitations of divide-and-conquer strategies for large-scale phylogenetic tree estimation.
Collapse
|
12
|
Zhang C, Scornavacca C, Molloy EK, Mirarab S. ASTRAL-Pro: Quartet-Based Species-Tree Inference despite Paralogy. Mol Biol Evol 2020; 37:3292-3307. [PMID: 32886770 PMCID: PMC7751180 DOI: 10.1093/molbev/msaa139] [Citation(s) in RCA: 80] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Phylogenetic inference from genome-wide data (phylogenomics) has revolutionized the study of evolution because it enables accounting for discordance among evolutionary histories across the genome. To this end, summary methods have been developed to allow accurate and scalable inference of species trees from gene trees. However, most of these methods, including the widely used ASTRAL, can only handle single-copy gene trees and do not attempt to model gene duplication and gene loss. As a result, most phylogenomic studies have focused on single-copy genes and have discarded large parts of the data. Here, we first propose a measure of quartet similarity between single-copy and multicopy trees that accounts for orthology and paralogy. We then introduce a method called ASTRAL-Pro (ASTRAL for PaRalogs and Orthologs) to find the species tree that optimizes our quartet similarity measure using dynamic programing. By studying its performance on an extensive collection of simulated data sets and on real data sets, we show that ASTRAL-Pro is more accurate than alternative methods.
Collapse
Affiliation(s)
- Chao Zhang
- Bioinformatics and Systems Biology, University of California San Diego, San Diego, CA
| | | | - Erin K Molloy
- Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, CA
| |
Collapse
|
13
|
Alda F, Tagliacollo VA, Bernt MJ, Waltz BT, Ludt WB, Faircloth BC, Alfaro ME, Albert JS, Chakrabarty P. Resolving Deep Nodes in an Ancient Radiation of Neotropical Fishes in the Presence of Conflicting Signals from Incomplete Lineage Sorting. Syst Biol 2018; 68:573-593. [DOI: 10.1093/sysbio/syy085] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2018] [Revised: 11/30/2018] [Accepted: 12/03/2018] [Indexed: 12/13/2022] Open
Affiliation(s)
- Fernando Alda
- Museum of Natural Science, Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA
- Department of Biology, Geology and Environmental Science, University of Tennessee at Chattanooga, Chattanooga, TN 37403, USA
| | - Victor A Tagliacollo
- Museu de Zoologia da Universidade de São Paulo (MZUSP), Ipirianga, 04263-000, São Paulo, São Paulo, Brazil
| | - Maxwell J Bernt
- Department of Biology, University of Louisiana at Lafayette, Lafayette, LA 70503, USA
| | - Brandon T Waltz
- Department of Biology, University of Louisiana at Lafayette, Lafayette, LA 70503, USA
| | - William B Ludt
- Museum of Natural Science, Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA
| | - Brant C Faircloth
- Museum of Natural Science, Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA
| | - Michael E Alfaro
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA 90095, USA
| | - James S Albert
- Department of Biology, University of Louisiana at Lafayette, Lafayette, LA 70503, USA
| | - Prosanta Chakrabarty
- Museum of Natural Science, Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA
| |
Collapse
|
14
|
Nute M, Chou J, Molloy EK, Warnow T. The performance of coalescent-based species tree estimation methods under models of missing data. BMC Genomics 2018; 19:286. [PMID: 29745854 PMCID: PMC5998899 DOI: 10.1186/s12864-018-4619-8] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND Estimation of species trees from multiple genes is complicated by processes such as incomplete lineage sorting, gene duplication and loss, and horizontal gene transfer, that result in gene trees that differ from each other and from the species phylogeny. Methods to estimate species trees in the presence of gene tree discord due to incomplete lineage sorting have been developed and proved to be statistically consistent when gene tree discord is due only to incomplete lineage sorting and every gene tree includes the full set of species. RESULTS We establish statistical consistency of certain coalescent-based species tree estimation methods under some models of taxon deletion from genes. We also evaluate the impact of missing data on four species tree estimation methods (ASTRAL-II, ASTRID, MP-EST, and SVDquartets) using simulated datasets with varying levels of incomplete lineage sorting, gene tree estimation error, and degrees/patterns of missing data. CONCLUSIONS All the species tree estimation methods improved in accuracy as the number of genes increased and often produced highly accurate species trees even when the amount of missing data was large. These results together indicate that accurate species tree estimation is possible under a variety of conditions, even when there are substantial amounts of missing data.
Collapse
Affiliation(s)
- Michael Nute
- Department of Statistics, University of Illinois at Urbana-Champaign, 725 S. Wright St., Champaign, IL, 61820 USA
| | - Jed Chou
- Department of Mathematics, University of Illinois at Urbana-Champaign, 1409 W. Green St., Urbana, IL, 61801 USA
| | - Erin K. Molloy
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, IL, 61801 USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, IL, 61801 USA
| |
Collapse
|
15
|
Zhang C, Rabiee M, Sayyari E, Mirarab S. ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinformatics 2018; 19:153. [PMID: 29745866 PMCID: PMC5998893 DOI: 10.1186/s12859-018-2129-y] [Citation(s) in RCA: 1034] [Impact Index Per Article: 172.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Background Evolutionary histories can be discordant across the genome, and such discordances need to be considered in reconstructing the species phylogeny. ASTRAL is one of the leading methods for inferring species trees from gene trees while accounting for gene tree discordance. ASTRAL uses dynamic programming to search for the tree that shares the maximum number of quartet topologies with input gene trees, restricting itself to a predefined set of bipartitions. Results We introduce ASTRAL-III, which substantially improves the running time of ASTRAL-II and guarantees polynomial running time as a function of both the number of species (n) and the number of genes (k). ASTRAL-III limits the bipartition constraint set (X) to grow at most linearly with n and k. Moreover, it handles polytomies more efficiently than ASTRAL-II, exploits similarities between gene trees better, and uses several techniques to avoid searching parts of the search space that are mathematically guaranteed not to include the optimal tree. The asymptotic running time of ASTRAL-III in the presence of polytomies is \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$O\left ((nk)^{1.726} D \right)$\end{document}O(nk)1.726D where D=O(nk) is the sum of degrees of all unique nodes in input trees. The running time improvements enable us to test whether contracting low support branches in gene trees improves the accuracy by reducing noise. In extensive simulations, we show that removing branches with very low support (e.g., below 10%) improves accuracy while overly aggressive filtering is harmful. We observe on a biological avian phylogenomic dataset of 14K genes that contracting low support branches greatly improve results. Conclusions ASTRAL-III is a faster version of the ASTRAL method for phylogenetic reconstruction and can scale up to 10,000 species. With ASTRAL-III, low support branches can be removed, resulting in improved accuracy. Electronic supplementary material The online version of this article (10.1186/s12859-018-2129-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Chao Zhang
- Bioinformatics and Systems Biology, University of California at San Diego, 9500 Gilman Drive, La Jolla, 92093-0021, CA, USA
| | - Maryam Rabiee
- Department of Computer Science and Engineering, University of California at San Diego, 9500 Gilman Drive, La Jolla, 92093-0021, CA, USA
| | - Erfan Sayyari
- Department of Electrical and Computer Engineering, University of California at San Diego, 9500 Gilman Drive, La Jolla, 92093-0021, CA, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California at San Diego, 9500 Gilman Drive, La Jolla, 92093-0021, CA, USA.
| |
Collapse
|
16
|
Sayyari E, Mirarab S. Testing for Polytomies in Phylogenetic Species Trees Using Quartet Frequencies. Genes (Basel) 2018; 9:E132. [PMID: 29495636 PMCID: PMC5867853 DOI: 10.3390/genes9030132] [Citation(s) in RCA: 78] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2017] [Revised: 01/30/2018] [Accepted: 02/16/2018] [Indexed: 12/23/2022] Open
Abstract
Phylogenetic species trees typically represent the speciation history as a bifurcating tree. Speciation events that simultaneously create more than two descendants, thereby creating polytomies in the phylogeny, are possible. Moreover, the inability to resolve relationships is often shown as a (soft) polytomy. Both types of polytomies have been traditionally studied in the context of gene tree reconstruction from sequence data. However, polytomies in the species tree cannot be detected or ruled out without considering gene tree discordance. In this paper, we describe a statistical test based on properties of the multi-species coalescent model to test the null hypothesis that a branch in an estimated species tree should be replaced by a polytomy. On both simulated and biological datasets, we show that the null hypothesis is rejected for all but the shortest branches, and in most cases, it is retained for true polytomies. The test, available as part of the Accurate Species TRee ALgorithm (ASTRAL) package, can help systematists decide whether their datasets are sufficient to resolve specific relationships of interest.
Collapse
Affiliation(s)
- Erfan Sayyari
- Department of Electrical and Computer Engineering, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA.
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA.
| |
Collapse
|
17
|
ASTRAL-III: Increased Scalability and Impacts of Contracting Low Support Branches. COMPARATIVE GENOMICS 2017. [DOI: 10.1007/978-3-319-67979-2_4] [Citation(s) in RCA: 97] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/05/2022]
|