1
|
Le T, Sy A, Molloy EK, Zhang Q, Rao S, Warnow T. Using Constrained-INC for Large-Scale Gene Tree and Species Tree Estimation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2-15. [PMID: 32750844 DOI: 10.1109/tcbb.2020.2990867] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Incremental tree building (INC) is a new phylogeny estimation method that has been proven to be absolute fast converging under standard sequence evolution models. A variant of INC, called Constrained-INC, is designed for use in divide-and-conquer pipelines for phylogeny estimation where a set of species is divided into disjoint subsets, trees are computed on the subsets using a selected base method, and then the subset trees are combined together. We evaluate the accuracy of INC and Constrained-INC for gene tree and species tree estimation on simulated datasets, and compare it to similar pipelines using NJMerge (another method that merges disjoint trees). For gene tree estimation, we find that INC has very poor accuracy in comparison to standard methods, and even Constrained-INC(using maximum likelihood methods to compute constraint trees) does not match the accuracy of the better maximum likelihood methods. Results for species trees are somewhat different, with Constrained-INC coming close to the accuracy of the best species tree estimation methods, while being much faster; furthermore, using Constrained-INC allows species tree estimation methods to scale to large datasets within limited computational resources. Overall, this study exposes the benefits and limitations of divide-and-conquer strategies for large-scale phylogenetic tree estimation.
Collapse
|
2
|
Bhattacharjee A, Bayzid MS. Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices. BMC Genomics 2020; 21:497. [PMID: 32689946 PMCID: PMC7370488 DOI: 10.1186/s12864-020-06892-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2020] [Accepted: 07/07/2020] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND With the rapid growth rate of newly sequenced genomes, species tree inference from genes sampled throughout the whole genome has become a basic task in comparative and evolutionary biology. However, substantial challenges remain in leveraging these large scale molecular data. One of the foremost challenges is to develop efficient methods that can handle missing data. Popular distance-based methods, such as NJ (neighbor joining) and UPGMA (unweighted pair group method with arithmetic mean) require complete distance matrices without any missing data. RESULTS We introduce two highly accurate machine learning based distance imputation techniques. These methods are based on matrix factorization and autoencoder based deep learning architectures. We evaluated these two methods on a collection of simulated and biological datasets. Experimental results suggest that our proposed methods match or improve upon the best alternate distance imputation techniques. Moreover, these methods are scalable to large datasets with hundreds of taxa, and can handle a substantial amount of missing data. CONCLUSIONS This study shows, for the first time, the power and feasibility of applying deep learning techniques for imputing distance matrices. Thus, this study advances the state-of-the-art in phylogenetic tree construction in the presence of missing data. The proposed methods are available in open source form at https://github.com/Ananya-Bhattacharjee/ImputeDistances .
Collapse
Affiliation(s)
- Ananya Bhattacharjee
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205 Bangladesh
- Department of Computer Science and Engineering, Eastern University, Dhaka, Bangladesh
| | - Md. Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205 Bangladesh
| |
Collapse
|
3
|
Abstract
Background Phylogeny estimation is an important part of much biological research, but large-scale tree estimation is infeasible using standard methods due to computational issues. Recently, an approach to large-scale phylogeny has been proposed that divides a set of species into disjoint subsets, computes trees on the subsets, and then merges the trees together using a computed matrix of pairwise distances between the species. The novel component of these approaches is the last step: Disjoint Tree Merger (DTM) methods. Results We present GTM (Guide Tree Merger), a polynomial time DTM method that adds edges to connect the subset trees, so as to provably minimize the topological distance to a computed guide tree. Thus, GTM performs unblended mergers, unlike the previous DTM methods. Yet, despite the potential limitation, our study shows that GTM has excellent accuracy, generally matching or improving on two previous DTMs, and is much faster than both. Conclusions The proposed GTM approach to the DTM problem is a useful new tool for large-scale phylogenomic analysis, and shows the surprising potential for unblended DTM methods.
Collapse
Affiliation(s)
- Vladimir Smirnov
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N Goodwin Ave, Urbana, 61801, IL, US
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N Goodwin Ave, Urbana, 61801, IL, US.
| |
Collapse
|
4
|
Islam M, Sarker K, Das T, Reaz R, Bayzid MS. STELAR: a statistically consistent coalescent-based species tree estimation method by maximizing triplet consistency. BMC Genomics 2020; 21:136. [PMID: 32039704 PMCID: PMC7011378 DOI: 10.1186/s12864-020-6519-y] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2019] [Accepted: 01/20/2020] [Indexed: 12/14/2022] Open
Abstract
Background Species tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, estimating a species tree from a collection of gene trees can be complicated due to the presence of gene tree incongruence resulting from incomplete lineage sorting (ILS), which is modelled by the multi-species coalescent process. Maximum likelihood and Bayesian MCMC methods can potentially result in accurate trees, but they do not scale well to large datasets. Results We present STELAR (Species Tree Estimation by maximizing tripLet AgReement), a new fast and highly accurate statistically consistent coalescent-based method for estimating species trees from a collection of gene trees. We formalized the constrained triplet consensus (CTC) problem and showed that the solution to the CTC problem is a statistically consistent estimate of the species tree under the multi-species coalescent (MSC) model. STELAR is an efficient dynamic programming based solution to the CTC problem which is highly accurate and scalable. We evaluated the accuracy of STELAR in comparison with SuperTriplets, which is an alternate fast and highly accurate triplet-based supertree method, and with MP-EST and ASTRAL – two of the most popular and accurate coalescent-based methods. Experimental results suggest that STELAR matches the accuracy of ASTRAL and improves on MP-EST and SuperTriplets. Conclusions Theoretical and empirical results (on both simulated and real biological datasets) suggest that STELAR is a valuable technique for species tree estimation from gene tree distributions.
Collapse
Affiliation(s)
- Mazharul Islam
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| | - Kowshika Sarker
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| | - Trisha Das
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| | - Rezwana Reaz
- Department of Computer Science, The University of Texas at Austin, Texas, 78712, USA
| | - Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh.
| |
Collapse
|
5
|
Data, time and money: evaluating the best compromise for inferring molecular phylogenies of non-model animal taxa. Mol Phylogenet Evol 2020; 142:106660. [DOI: 10.1016/j.ympev.2019.106660] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Revised: 10/17/2019] [Accepted: 10/17/2019] [Indexed: 12/15/2022]
|
6
|
Molloy EK, Warnow T. Statistically consistent divide-and-conquer pipelines for phylogeny estimation using NJMerge. Algorithms Mol Biol 2019; 14:14. [PMID: 31360216 PMCID: PMC6642500 DOI: 10.1186/s13015-019-0151-x] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2019] [Accepted: 06/13/2019] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND Divide-and-conquer methods, which divide the species set into overlapping subsets, construct a tree on each subset, and then combine the subset trees using a supertree method, provide a key algorithmic framework for boosting the scalability of phylogeny estimation methods to large datasets. Yet the use of supertree methods, which typically attempt to solve NP-hard optimization problems, limits the scalability of such approaches. RESULTS In this paper, we introduce a divide-and-conquer approach that does not require supertree estimation: we divide the species set into pairwise disjoint subsets, construct a tree on each subset using a base method, and then combine the subset trees using a distance matrix. For this merger step, we present a new method, called NJMerge, which is a polynomial-time extension of Neighbor Joining (NJ); thus, NJMerge can be viewed either as a method for improving traditional NJ or as a method for scaling the base method to larger datasets. We prove that NJMerge can be used to create divide-and-conquer pipelines that are statistically consistent under some models of evolution. We also report the results of an extensive simulation study evaluating NJMerge on multi-locus datasets with up to 1000 species. We found that NJMerge sometimes improved the accuracy of traditional NJ and substantially reduced the running time of three popular species tree methods (ASTRAL-III, SVDquartets, and "concatenation" using RAxML) without sacrificing accuracy. Finally, although NJMerge can fail to return a tree, in our experiments, NJMerge failed on only 11 out of 2560 test cases. CONCLUSIONS Theoretical and empirical results suggest that NJMerge is a valuable technique for large-scale phylogeny estimation, especially when computational resources are limited. NJMerge is freely available on Github (http://github.com/ekmolloy/njmerge).
Collapse
Affiliation(s)
- Erin K. Molloy
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, IL 61801 USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, IL 61801 USA
| |
Collapse
|
7
|
Eiserhardt WL, Antonelli A, Bennett DJ, Botigué LR, Burleigh JG, Dodsworth S, Enquist BJ, Forest F, Kim JT, Kozlov AM, Leitch IJ, Maitner BS, Mirarab S, Piel WH, Pérez-Escobar OA, Pokorny L, Rahbek C, Sandel B, Smith SA, Stamatakis A, Vos RA, Warnow T, Baker WJ. A roadmap for global synthesis of the plant tree of life. AMERICAN JOURNAL OF BOTANY 2018; 105:614-622. [PMID: 29603138 DOI: 10.1002/ajb2.1041] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/13/2017] [Accepted: 11/08/2017] [Indexed: 06/08/2023]
Abstract
Providing science and society with an integrated, up-to-date, high quality, open, reproducible and sustainable plant tree of life would be a huge service that is now coming within reach. However, synthesizing the growing body of DNA sequence data in the public domain and disseminating the trees to a diverse audience are often not straightforward due to numerous informatics barriers. While big synthetic plant phylogenies are being built, they remain static and become quickly outdated as new data are published and tree-building methods improve. Moreover, the body of existing phylogenetic evidence is hard to navigate and access for non-experts. We propose that our community of botanists, tree builders, and informaticians should converge on a modular framework for data integration and phylogenetic analysis, allowing easy collaboration, updating, data sourcing and flexible analyses. With support from major institutions, this pipeline should be re-run at regular intervals, storing trees and their metadata long-term. Providing the trees to a diverse global audience through user-friendly front ends and application development interfaces should also be a priority. Interactive interfaces could be used to solicit user feedback and thus improve data quality and to coordinate the generation of new data. We conclude by outlining a number of steps that we suggest the scientific community should take to achieve global phylogenetic synthesis.
Collapse
Affiliation(s)
- Wolf L Eiserhardt
- Royal Botanic Gardens, Kew, TW9 3AE, Richmond, Surrey, UK
- Department of Bioscience, Aarhus University, Ny Munkegade 116, 8000, Aarhus C, Denmark
| | - Alexandre Antonelli
- Gothenburg Global Biodiversity Centre, Box 461, 405 30, Gothenburg, Sweden
- Department of Biological and Environmental Sciences, University of Gothenburg, Box 461, 405 30, Gothenburg, Sweden
- Gothenburg Botanical Garden, Carl Skottsbergs Gata 22B, SE-413 19, Gothenburg, Sweden
| | - Dominic J Bennett
- Gothenburg Global Biodiversity Centre, Box 461, 405 30, Gothenburg, Sweden
- Department of Biological and Environmental Sciences, University of Gothenburg, Box 461, 405 30, Gothenburg, Sweden
- Gothenburg Botanical Garden, Carl Skottsbergs Gata 22B, SE-413 19, Gothenburg, Sweden
| | | | | | | | - Brian J Enquist
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ, 85721, USA
- The Santa Fe Institute, Santa Fe, NM, 87501, USA
| | - Félix Forest
- Royal Botanic Gardens, Kew, TW9 3AE, Richmond, Surrey, UK
| | - Jan T Kim
- Royal Botanic Gardens, Kew, TW9 3AE, Richmond, Surrey, UK
| | - Alexey M Kozlov
- Scientific Computing Group, Heidelberg Institute for Theoretical Studies, 69118, Heidelberg, Germany
| | - Ilia J Leitch
- Royal Botanic Gardens, Kew, TW9 3AE, Richmond, Surrey, UK
| | - Brian S Maitner
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ, 85721, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California, San Diego, San Diego, CA, 92093, USA
| | - William H Piel
- Yale-NUS College, 16 College Avenue West, Singapore, 138527, Republic of Singapore
| | | | - Lisa Pokorny
- Royal Botanic Gardens, Kew, TW9 3AE, Richmond, Surrey, UK
| | - Carsten Rahbek
- Center for Macroecology, Evolution and Climate, University of Copenhagen, Universitetsparken 15, DK-2100, Copenhagen O, Denmark
- Imperial College London, Silwood Park, Buckhurst Road, Ascot, Berkshire, SL5 7PY, UK
| | - Brody Sandel
- Department of Biology, Santa Clara University, Santa Clara, CA, 95053, USA
| | - Stephen A Smith
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Alexandros Stamatakis
- Scientific Computing Group, Heidelberg Institute for Theoretical Studies, 69118, Heidelberg, Germany
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, 76128, Karlsruhe, Germany
| | - Rutger A Vos
- Naturalis Biodiversity Center, P.O. Box 9517, 2300RA, Leiden, The Netherlands
- Institute of Biology Leiden, P.O. Box 9505, 2300RA, Leiden, The Netherlands
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | | |
Collapse
|
8
|
Abstract
BACKGROUND Building the evolutionary trees for massive unaligned DNA sequences is challenging and crucial. However, reconstructing evolutionary tree for ultra-large sequences is hard. Massive multiple sequence alignment is also challenging and time/space consuming. Hadoop and Spark are developed recently, which bring spring light for the classical computational biology problems. In this paper, we tried to solve the multiple sequence alignment and evolutionary reconstruction in parallel. RESULTS HPTree, which is developed in this paper, can deal with big DNA sequence files quickly. It works well on the >1GB files, and gets better performance than other evolutionary reconstruction tools. Users could use HPTree for reonstructing evolutioanry trees on the computer clusters or cloud platform (eg. Amazon Cloud). HPTree could help on population evolution research and metagenomics analysis. CONCLUSIONS In this paper, we employ the Hadoop and Spark platform and design an evolutionary tree reconstruction software tool for unaligned massive DNA sequences. Clustering and multiple sequence alignment are done in parallel. Neighbour-joining model was employed for the evolutionary tree building. We opened our software together with source codes via http://lab.malab.cn/soft/HPtree/ .
Collapse
Affiliation(s)
- Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin, People's Republic of China
- Guangdong Province Key Laboratory of Popular High Performance Computers, Shenzhen University, Shenzhen, China
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China
| | - Shixiang Wan
- School of Computer Science and Technology, Tianjin University, Tianjin, People's Republic of China
| | - Xiangxiang Zeng
- Department of Computer Science, Xiamen University, Xiamen, China.
| | - Zhanshan Sam Ma
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China.
| |
Collapse
|
9
|
Vachaspati P, Warnow T. FastRFS: fast and accurate Robinson-Foulds Supertrees using constrained exact optimization. Bioinformatics 2017; 33:631-639. [PMID: 27663499 PMCID: PMC5870905 DOI: 10.1093/bioinformatics/btw600] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2016] [Accepted: 09/12/2016] [Indexed: 01/22/2023] Open
Abstract
Motivation The estimation of phylogenetic trees is a major part of many biological dataset analyses, but maximum likelihood approaches are NP-hard and Bayesian MCMC methods do not scale well to even moderate-sized datasets. Supertree methods, which are used to construct trees from trees computed on subsets, are critically important tools for enabling the statistical estimation of phylogenies for large and potentially heterogeneous datasets. Supertree estimation is itself NP-hard, and no current supertree method has sufficient accuracy and scalability to provide good accuracy on the large datasets that supertree methods were designed for, containing thousands of species and many subset trees. Results We present FastRFS, a new method based on a dynamic programming method we have developed to find an exact solution to the Robinson-Foulds Supertree problem within a constrained search space. FastRFS has excellent accuracy in terms of criterion scores and topological accuracy of the resultant trees, substantially improving on competing methods on a large collection of biological and simulated data. In addition, FastRFS is extremely fast, finishing in minutes on even very large datasets, and in under an hour on a biological dataset with 2228 species. Availability and Implementation FastRFS is available on github at https://github.com/pranjalv123/FastRFS. Contact warnow@illinois.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Tandy Warnow
- Department of Computer Science.,National Center for Supercomputing Applications.,Department of Bioengineering, University of Illinois, Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
10
|
Molloy EK, Warnow T. To Include or Not to Include: The Impact of Gene Filtering on Species Tree Estimation Methods. Syst Biol 2017; 67:285-303. [DOI: 10.1093/sysbio/syx077] [Citation(s) in RCA: 138] [Impact Index Per Article: 19.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2016] [Accepted: 09/13/2017] [Indexed: 01/27/2023] Open
Affiliation(s)
- Erin K Molloy
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| |
Collapse
|
11
|
Bhattacharyya S, Mukherjee J. IDXL: Species Tree Inference Using Internode Distance and Excess Gene Leaf Count. J Mol Evol 2017; 85:57-78. [PMID: 28835989 DOI: 10.1007/s00239-017-9807-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2016] [Accepted: 08/09/2017] [Indexed: 11/28/2022]
Abstract
We propose an extension of the distance matrix methods NJst and ASTRID to infer species trees from incongruent gene trees having Incomplete Lineage Sorting. Both approaches consider the average internode distance (ID) between individual taxa pairs as the distance measure. The measure ID does not use the root of a tree, and thus may not always infer the relative position of a taxon with respect to the root. We define a novel distance measure excess gene leaf count (XL) between individual couplets. The XL measure is computed using the root of a tree. It is proved to be additive, and is shown to infer the relative order of divergence among individual couplets better. We propose a novel method IDXL which uses both the XL and ID measures for species tree construction. IDXL is shown to perform better than NJst and other distance matrix approaches for most of the biological and simulated datasets. Having the same computational complexity as NJst, IDXL can be applied for species tree inference on large-scale biological datasets.
Collapse
Affiliation(s)
- Sourya Bhattacharyya
- Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, WB, 721302, India.
| | - Jayanta Mukherjee
- Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, WB, 721302, India
| |
Collapse
|
12
|
Mirarab S, Warnow T. ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 2015; 31:i44-52. [PMID: 26072508 PMCID: PMC4765870 DOI: 10.1093/bioinformatics/btv234] [Citation(s) in RCA: 578] [Impact Index Per Article: 64.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Motivation: The estimation of species phylogenies requires multiple loci, since different loci can have different trees due to incomplete lineage sorting, modeled by the multi-species coalescent model. We recently developed a coalescent-based method, ASTRAL, which is statistically consistent under the multi-species coalescent model and which is more accurate than other coalescent-based methods on the datasets we examined. ASTRAL runs in polynomial time, by constraining the search space using a set of allowed ‘bipartitions’. Despite the limitation to allowed bipartitions, ASTRAL is statistically consistent. Results: We present a new version of ASTRAL, which we call ASTRAL-II. We show that ASTRAL-II has substantial advantages over ASTRAL: it is faster, can analyze much larger datasets (up to 1000 species and 1000 genes) and has substantially better accuracy under some conditions. ASTRAL’s running time is O(n2k|X|2), and ASTRAL-II’s running time is O(nk|X|2), where n is the number of species, k is the number of loci and X is the set of allowed bipartitions for the search space. Availability and implementation: ASTRAL-II is available in open source at https://github.com/smirarab/ASTRAL and datasets used are available at http://www.cs.utexas.edu/~phylo/datasets/astral2/. Contact:smirarab@gmail.com Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Siavash Mirarab
- Department of Computer Science, The University of Texas at Austin, Austin, TX 78712, USA and Departments of Computer Science and Bioengineering, The University of Illinois at Urbana-Champaign, Champaign, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, The University of Texas at Austin, Austin, TX 78712, USA and Departments of Computer Science and Bioengineering, The University of Illinois at Urbana-Champaign, Champaign, IL 61801, USA
| |
Collapse
|
13
|
Chou J, Gupta A, Yaduvanshi S, Davidson R, Nute M, Mirarab S, Warnow T. A comparative study of SVDquartets and other coalescent-based species tree estimation methods. BMC Genomics 2015; 16 Suppl 10:S2. [PMID: 26449249 PMCID: PMC4602346 DOI: 10.1186/1471-2164-16-s10-s2] [Citation(s) in RCA: 95] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Species tree estimation is challenging in the presence of incomplete lineage sorting (ILS), which can make gene trees different from the species tree. Because ILS is expected to occur and the standard concatenation approach can return incorrect trees with high support in the presence of ILS, "coalescent-based" summary methods (which first estimate gene trees and then combine gene trees into a species tree) have been developed that have theoretical guarantees of robustness to arbitrarily high amounts of ILS. Some studies have suggested that summary methods should only be used on "c-genes" (i.e., recombination-free loci) that can be extremely short (sometimes fewer than 100 sites). However, gene trees estimated on short alignments can have high estimation error, and summary methods tend to have high error on short c-genes. To address this problem, Chifman and Kubatko introduced SVDquartets, a new coalescent-based method. SVDquartets takes multi-locus unlinked single-site data, infers the quartet trees for all subsets of four species, and then combines the set of quartet trees into a species tree using a quartet amalgamation heuristic. Yet, the relative accuracy of SVDquartets to leading coalescent-based methods has not been assessed. RESULTS We compared SVDquartets to two leading coalescent-based methods (ASTRAL-2 and NJst), and to concatenation using maximum likelihood. We used a collection of simulated datasets, varying ILS levels, numbers of taxa, and number of sites per locus. Although SVDquartets was sometimes more accurate than ASTRAL-2 and NJst, most often the best results were obtained using ASTRAL-2, even on the shortest gene sequence alignments we explored (with only 10 sites per locus). Finally, concatenation was the most accurate of all methods under low ILS conditions. CONCLUSIONS ASTRAL-2 generally had the best accuracy under higher ILS conditions, and concatenation had the best accuracy under the lowest ILS conditions. However, SVDquartets was competitive with the best methods under conditions with low ILS and small numbers of sites per locus. The good performance under many conditions of ASTRAL-2 in comparison to SVDquartets is surprising given the known vulnerability of ASTRAL-2 and similar methods to short gene sequences.
Collapse
Affiliation(s)
- Jed Chou
- Department of Mathematics, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Ashu Gupta
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Shashank Yaduvanshi
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Ruth Davidson
- Department of Mathematics, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Mike Nute
- Department of Statistics, University of Illinois Urbana-Champaign, Champaign, IL 61820, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, CA 92093, USA
- Department of Computer Science, University of Texas at Austin, Austin, TX 78712, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
- Department of Computer Science, University of Texas at Austin, Austin, TX 78712, USA
| |
Collapse
|
14
|
Abstract
BACKGROUND Incomplete lineage sorting (ILS), modelled by the multi-species coalescent (MSC), is known to create discordance between gene trees and species trees, and lead to inaccurate species tree estimations unless appropriate methods are used to estimate the species tree. While many statistically consistent methods have been developed to estimate the species tree in the presence of ILS, only ASTRAL-2 and NJst have been shown to have good accuracy on large datasets. Yet, NJst is generally slower and less accurate than ASTRAL-2, and cannot run on some datasets. RESULTS We have redesigned NJst to enable it to run on all datasets, and we have expanded its design space so that it can be used with different distance-based tree estimation methods. The resultant method, ASTRID, is statistically consistent under the MSC model, and has accuracy that is competitive with ASTRAL-2. Furthermore, ASTRID is much faster than ASTRAL-2, completing in minutes on some datasets for which ASTRAL-2 used hours. CONCLUSIONS ASTRID is a new coalescent-based method for species tree estimation that is competitive with the best current method in terms of accuracy, while being much faster. ASTRID is available in open source form on github.
Collapse
Affiliation(s)
- Pranjal Vachaspati
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Avenue, Urbana, IL, 61801 USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Avenue, Urbana, IL, 61801 USA
| |
Collapse
|
15
|
Roch S, Warnow T. On the Robustness to Gene Tree Estimation Error (or lack thereof) of Coalescent-Based Species Tree Methods. Syst Biol 2015; 64:663-76. [PMID: 25813358 DOI: 10.1093/sysbio/syv016] [Citation(s) in RCA: 95] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2014] [Accepted: 03/20/2015] [Indexed: 11/13/2022] Open
Abstract
The estimation of species trees using multiple loci has become increasingly common. Because different loci can have different phylogenetic histories (reflected in different gene tree topologies) for multiple biological causes, new approaches to species tree estimation have been developed that take gene tree heterogeneity into account. Among these multiple causes, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is potentially the most common cause of gene tree heterogeneity, and much of the focus of the recent literature has been on how to estimate species trees in the presence of ILS. Despite progress in developing statistically consistent techniques for estimating species trees when gene trees can differ due to ILS, there is substantial controversy in the systematics community as to whether to use the new coalescent-based methods or the traditional concatenation methods. One of the key issues that has been raised is understanding the impact of gene tree estimation error on coalescent-based methods that operate by combining gene trees. Here we explore the mathematical guarantees of coalescent-based methods when analyzing estimated rather than true gene trees. Our results provide some insight into the differences between promise of coalescent-based methods in theory and their performance in practice.
Collapse
Affiliation(s)
- Sebastien Roch
- Department of Mathematics, University of Wisconsin at Madison, 480 Lincoln Dr., Madison, Wisconsin, 53706, USA and Departments of Bioengineering and Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Tandy Warnow
- Department of Mathematics, University of Wisconsin at Madison, 480 Lincoln Dr., Madison, Wisconsin, 53706, USA and Departments of Bioengineering and Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| |
Collapse
|
16
|
Zimmermann T, Mirarab S, Warnow T. BBCA: Improving the scalability of *BEAST using random binning. BMC Genomics 2014; 15 Suppl 6:S11. [PMID: 25572469 PMCID: PMC4239591 DOI: 10.1186/1471-2164-15-s6-s11] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Background Species tree estimation can be challenging in the presence of gene tree conflict due to incomplete lineage sorting (ILS), which can occur when the time between speciation events is short relative to the population size. Of the many methods that have been developed to estimate species trees in the presence of ILS, *BEAST, a Bayesian method that co-estimates the species tree and gene trees given sequence alignments on multiple loci, has generally been shown to have the best accuracy. However, *BEAST is extremely computationally intensive so that it cannot be used with large numbers of loci; hence, *BEAST is not suitable for genome-scale analyses. Results We present BBCA (boosted binned coalescent-based analysis), a method that can be used with *BEAST (and other such co-estimation methods) to improve scalability. BBCA partitions the loci randomly into subsets, uses *BEAST on each subset to co-estimate the gene trees and species tree for the subset, and then combines the newly estimated gene trees together using MP-EST, a popular coalescent-based summary method. We compare time-restricted versions of BBCA and *BEAST on simulated datasets, and show that BBCA is at least as accurate as *BEAST, and achieves better convergence rates for large numbers of loci. Conclusions Phylogenomic analysis using *BEAST is currently limited to datasets with a small number of loci, and analyses with even just 100 loci can be computationally challenging. BBCA uses a very simple divide-and-conquer approach that makes it possible to use *BEAST on datasets containing hundreds of loci. This study shows that BBCA provides excellent accuracy and is highly scalable.
Collapse
|