1
|
Górecki P, Markin A, Eulenstein O. Exact median-tree inference for unrooted reconciliation costs. BMC Evol Biol 2020; 20:136. [PMID: 33115401 PMCID: PMC7593691 DOI: 10.1186/s12862-020-01700-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Solving median tree problems under tree reconciliation costs is a classic and well-studied approach for inferring species trees from collections of discordant gene trees. These problems are NP-hard, and therefore are, in practice, typically addressed by local search heuristics. So far, however, such heuristics lack any provable correctness or precision. Further, even for small phylogenetic studies, it has been demonstrated that local search heuristics may only provide sub-optimal solutions. Obviating such heuristic uncertainties are exact dynamic programming solutions that allow solving tree reconciliation problems for smaller phylogenetic studies. Despite these promises, such exact solutions are only suitable for credibly rooted input gene trees, which constitute only a tiny fraction of the readily available gene trees. Standard gene tree inference approaches provide only unrooted gene trees and accurately rooting such trees is often difficult, if not impossible. Results Here, we describe complex dynamic programming solutions that represent the first nonnaïve exact solutions for solving the tree reconciliation problems for unrooted input gene trees. Further, we show that the asymptotic runtime of the proposed solutions does not increase when compared to the most time-efficient dynamic programming solutions for rooted input trees. Conclusions In an experimental evaluation, we demonstrate that the described solutions for unrooted gene trees are, like the solutions for rooted input gene trees, suitable for smaller phylogenetic studies. Finally, for the first time, we study the accuracy of classic local search heuristics for unrooted tree reconciliation problems.
Collapse
Affiliation(s)
- Paweł Górecki
- University of Warsaw, Faculty of Mathematics, Informatics and Mechanics, Banacha 2, Warsaw, 02-097, Poland.
| | - Alexey Markin
- Department of Computer Science, Iowa State University, Atanasoff Hall 212, Ames, 50011, USA
| | - Oliver Eulenstein
- Department of Computer Science, Iowa State University, Atanasoff Hall 212, Ames, 50011, USA
| |
Collapse
|
2
|
Markin A, Eulenstein O. Cophenetic Median Trees. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1459-1470. [PMID: 30222583 DOI: 10.1109/tcbb.2018.2870173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Median tree inference under path-difference metrics has shown great promise for large-scale phylogeny estimation. Similar to these metrics is the family of cophenetic metrics that originates from a classic dendrogram comparison method introduced more than 50 years ago. Despite the appeal of this family of metrics, the problem of computing median trees under cophenetic metrics has not been analyzed. Like other standard median tree problems relevant in practice, as we show here, this problem is also NP-hard. NP-hard median tree problems have been successfully addressed by local search heuristics that are solving thousands of instances of a corresponding (local neighborhood) search problem. For the local neighborhood search problem under a cophenetic metric, the best known (naïve) algorithm has a time complexity that is typically prohibitive for effective heuristic searches. Building on the pioneering work on path-difference median trees, we develop efficient algorithms for Manhattan and Euclidean cophenetic search problems that improve on the naïve solution by a linear and a quadratic factor, respectively. We demonstrate the performance and effectiveness of the resulting heuristic methods in a comparative study using benchmark empirical datasets.
Collapse
|
3
|
Markin A, Eulenstein O. Efficient Local Search for Euclidean Path-Difference Median Trees. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1374-1385. [PMID: 29035224 DOI: 10.1109/tcbb.2017.2763137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Synthesizing large-scale phylogenetic trees is a fundamental problem in evolutionary biology. Median tree problems have evolved as a powerful tool to reconstruct such trees. Such problems seek a median tree for a given collection of input trees under some problem-specific tree distance. There has been an increased interest in the median tree problem for the classical path-difference distance between trees. While this problem is NP-hard, standard local search heuristics have been described that are based on solving a local search problem exactly. For a more effective heuristic we devise a time efficient algorithm for the local search problem that improves on the best-known solution by a factor of $n$n, where $n$n is the size of the input trees. Furthermore, we introduce a novel hybrid version of the standard local search that is exploiting our new algorithm for a more refined heuristic search. Finally, we demonstrate the performance of our hybrid heuristic in a comparative study with other commonly used methods that synthesize species trees using published empirical data sets.
Collapse
|
4
|
Whidden C, Zeh N, Beiko RG. Supertrees Based on the Subtree Prune-and-Regraft Distance. Syst Biol 2014; 63:566-81. [PMID: 24695589 PMCID: PMC4055872 DOI: 10.1093/sysbio/syu023] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2013] [Accepted: 03/18/2014] [Indexed: 11/14/2022] Open
Abstract
Supertree methods reconcile a set of phylogenetic trees into a single structure that is often interpreted as a branching history of species. A key challenge is combining conflicting evolutionary histories that are due to artifacts of phylogenetic reconstruction and phenomena such as lateral gene transfer (LGT). Many supertree approaches use optimality criteria that do not reflect underlying processes, have known biases, and may be unduly influenced by LGT. We present the first method to construct supertrees by using the subtree prune-and-regraft (SPR) distance as an optimality criterion. Although calculating the rooted SPR distance between a pair of trees is NP-hard, our new maximum agreement forest-based methods can reconcile trees with hundreds of taxa and>50 transfers in fractions of a second, which enables repeated calculations during the course of an iterative search. Our approach can accommodate trees in which uncertain relationships have been collapsed to multifurcating nodes. Using a series of benchmark datasets simulated under plausible rates of LGT, we show that SPR supertrees are more similar to correct species histories than supertrees based on parsimony or Robinson-Foulds distance criteria. We successfully constructed an SPR supertree from a phylogenomic dataset of 40,631 gene trees that covered 244 genomes representing several major bacterial phyla. Our SPR-based approach also allowed direct inference of highways of gene transfer between bacterial classes and genera. A Small number of these highways connect genera in different phyla and can highlight specific genes implicated in long-distance LGT. [Lateral gene transfer; matrix representation with parsimony; phylogenomics; prokaryotic phylogeny; Robinson-Foulds; subtree prune-and-regraft; supertrees.].
Collapse
Affiliation(s)
- Christopher Whidden
- Faculty of Computer Science, Dalhousie University, 6050 University Avenue, PO Box 15000, Halifax, Nova Scotia, Canada B3H 4R2
| | - Norbert Zeh
- Faculty of Computer Science, Dalhousie University, 6050 University Avenue, PO Box 15000, Halifax, Nova Scotia, Canada B3H 4R2
| | - Robert G Beiko
- Faculty of Computer Science, Dalhousie University, 6050 University Avenue, PO Box 15000, Halifax, Nova Scotia, Canada B3H 4R2
| |
Collapse
|
5
|
Bansal MS, Eulenstein O. Algorithms for genome-scale phylogenetics using gene tree parsimony. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:939-956. [PMID: 24334388 DOI: 10.1109/tcbb.2013.103] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
The use of genomic data sets for phylogenetics is complicated by the fact that evolutionary processes such as gene duplication and loss, or incomplete lineage sorting (deep coalescence) cause incongruence among gene trees. One well-known approach that deals with this complication is gene tree parsimony, which, given a collection of gene trees, seeks a species tree that requires the smallest number of evolutionary events to explain the incongruence of the gene trees. However, a lack of efficient algorithms has limited the use of this approach. Here, we present efficient algorithms for SPR and TBR-based local search heuristics for gene tree parsimony under the 1) duplication, 2) loss, 3) duplication-loss, and 4) deep coalescence reconciliation costs. These novel algorithms improve upon the time complexities of previous algorithms for these problems by a factor of n, where n is the number of species in the collection of gene trees. Our algorithms provide a substantial improvement in runtime and scalability compared to previous implementations and enable large-scale gene tree parsimony analyses using any of the four reconciliation costs. Our algorithms have been implemented in the software packages DupTree and iGTP, and have already been used to perform several compelling phylogenetic studies.
Collapse
|
6
|
Górecki P, Eulenstein O, Tiuryn J. Unrooted tree reconciliation: a unified approach. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:522-536. [PMID: 23929875 DOI: 10.1109/tcbb.2013.22] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Tree comparison functions are widely used in phylogenetics for comparing evolutionary trees. Unrooted trees can be compared with rooted trees by identifying all rootings of the unrooted tree that minimize some provided comparison function between two rooted trees. The plateau property is satisfied by the provided function, if all optimal rootings form a subtree, or plateau, in the unrooted tree, from which the rootings along every path toward a leaf have monotonically increasing costs. This property is sufficient for the linear-time identification of all optimal rootings and rooting costs. However, the plateau property has only been proven for a few rooted comparison functions, requiring individual proofs for each function without benefitting from inherent structural features of such functions. Here, we introduce the consistency condition that is sufficient for a general function to satisfy the plateau property. For consistent functions, we introduce general linear-time solutions that identify optimal rootings and all rooting costs. Further, we identify novel relationships between consistent functions in terms of plateaus, especially the plateau of the well-studied duplication-loss function is part of a plateau of every other consistent function. We introduce a novel approach for identifying consistent cost functions by defining a formal language of Boolean costs. Formulas in this language can be interpreted as cost functions. Finally, we demonstrate the performance of our general linear-time solutions in practice using empirical and simulation studies.
Collapse
Affiliation(s)
- Pawel Górecki
- Department of Mathematics, Informatics and Mechanics, University of Warsaw, Warsaw, Mazowieckie 02-097, Poland
| | | | | |
Collapse
|
7
|
Chaudhary R, Burleigh JG, Fernández-Baca D. Fast local search for unrooted Robinson-Foulds supertrees. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1004-1013. [PMID: 22431553 DOI: 10.1109/tcbb.2012.47] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
A Robinson-Foulds (RF) supertree for a collection of input trees is a tree containing all the species in the input trees that is at minimum total RF distance to the input trees. Thus, an RF supertree is consistent with the maximum number of splits in the input trees. Constructing RF supertrees for rooted and unrooted data is NP-hard. Nevertheless, effective local search heuristics have been developed for the restricted case where the input trees and the supertree are rooted. We describe new heuristics, based on the Edge Contract and Refine (ECR) operation, that remove this restriction, thereby expanding the utility of RF supertrees. Our experimental results on simulated and empirical data sets show that our unrooted local search algorithms yield better supertrees than those obtained from MRP and rooted RF heuristics in terms of total RF distance to the input trees and, for simulated data, in terms of RF distance to the true tree.
Collapse
Affiliation(s)
- Ruchi Chaudhary
- Department of Computer Science, Iowa State University, Atanasoff Hall, Ames, IA 50011-1041, USA.
| | | | | |
Collapse
|
8
|
Davis RB, Nicholson DB, Saunders ELR, Mayhew PJ. Fossil gaps inferred from phylogenies alter the apparent nature of diversification in dragonflies and their relatives. BMC Evol Biol 2011; 11:252. [PMID: 21917167 PMCID: PMC3179963 DOI: 10.1186/1471-2148-11-252] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2011] [Accepted: 09/14/2011] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND The fossil record has suggested that clade growth may differ in marine and terrestrial taxa, supporting equilibrial models in the former and expansionist models in the latter. However, incomplete sampling may bias findings based on fossil data alone. To attempt to correct for such bias, we assemble phylogenetic supertrees on one of the oldest clades of insects, the Odonatoidea (dragonflies, damselflies and their extinct relatives), using MRP and MRC. We use the trees to determine when, and in what clades, changes in taxonomic richness have occurred. We then test whether equilibrial or expansionist models are supported by fossil data alone, and whether findings differ when phylogenetic information is used to infer gaps in the fossil record. RESULTS There is broad agreement in family-level relationships between both supertrees, though with some uncertainty along the backbone of the tree regarding dragonflies (Anisoptera). "Anisozygoptera" are shown to be paraphyletic when fossil information is taken into account. In both trees, decreases in net diversification are associated with species-poor extant families (Neopetaliidae, Hemiphlebiidae), and an upshift is associated with Calopterygidae + Polythoridae. When ghost ranges are inferred from the fossil record, many families are shown to have much earlier origination dates. In a phylogenetic context, the number of family-level lineages is shown to be up to twice as high as the fossil record alone suggests through the Cretaceous and Cenozoic, and a logistic increase in richness is detected in contrast to an exponential increase indicated by fossils alone. CONCLUSIONS Our analysis supports the notion that taxa, which appear to have diversified exponentially using fossil data, may in fact have diversified more logistically. This in turn suggests that one of the major apparent differences between the marine and terrestrial fossil record may simply be an artifact of incomplete sampling. Our results also support previous notions that adult colouration plays an important role in odonate radiation, and that Anisozygoptera should be grouped in a single inclusive taxon with Anisoptera, separate from Zygoptera.
Collapse
Affiliation(s)
- Robert B Davis
- Department of Biology, University of York, York, YO10 5YW, UK
- Department of Zoology, University of Tartu, Vanemuise 46, EE-51014 Tartu, Estonia
| | - David B Nicholson
- Department of Biology, University of York, York, YO10 5YW, UK
- Department of Palaeontology, The Natural History Museum, Cromwell Road, London, SW7 5BD, UK
- National Museums of Scotland, Department of Natural Sciences, Edinburgh, Midlothian, EH1 1JF, UK
| | | | - Peter J Mayhew
- Department of Biology, University of York, York, YO10 5YW, UK
| |
Collapse
|
9
|
Swenson MS, Suri R, Linder CR, Warnow T. An experimental study of Quartets MaxCut and other supertree methods. Algorithms Mol Biol 2011; 6:7. [PMID: 21504600 PMCID: PMC3101644 DOI: 10.1186/1748-7188-6-7] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2010] [Accepted: 04/19/2011] [Indexed: 12/01/2022] Open
Abstract
BACKGROUND Supertree methods represent one of the major ways by which the Tree of Life can be estimated, but despite many recent algorithmic innovations, matrix representation with parsimony (MRP) remains the main algorithmic supertree method. RESULTS We evaluated the performance of several supertree methods based upon the Quartets MaxCut (QMC) method of Snir and Rao and showed that two of these methods usually outperform MRP and five other supertree methods that we studied, under many realistic model conditions. However, the QMC-based methods have scalability issues that may limit their utility on large datasets. We also observed that taxon sampling impacted supertree accuracy, with poor results obtained when all of the source trees were only sparsely sampled. Finally, we showed that the popular optimality criterion of minimizing the total topological distance of the supertree to the source trees is only weakly correlated with supertree topological accuracy. Therefore evaluating supertree methods on biological datasets is problematic. CONCLUSIONS Our results show that supertree methods that improve upon MRP are possible, and that an effort should be made to produce scalable and robust implementations of the most accurate supertree methods. Also, because topological accuracy depends upon taxon sampling strategies, attempts to construct very large phylogenetic trees using supertree methods should consider the selection of source tree datasets, as well as supertree methods. Finally, since supertree topological error is only weakly correlated with the supertree's topological distance to its source trees, development and testing of supertree methods presents methodological challenges.
Collapse
Affiliation(s)
- M Shel Swenson
- Department of Computer Science, The University of Texas at Austin, Austin TX, USA
| | - Rahul Suri
- Department of Computer Science, The University of Texas at Austin, Austin TX, USA
| | - C Randal Linder
- Section of Integrative Biology, The University of Texas at Austin, Austin TX, USA
| | - Tandy Warnow
- Department of Computer Science, The University of Texas at Austin, Austin TX, USA
| |
Collapse
|
10
|
Bansal MS, Burleigh JG, Eulenstein O. Efficient genome-scale phylogenetic analysis under the duplication-loss and deep coalescence cost models. BMC Bioinformatics 2010; 11 Suppl 1:S42. [PMID: 20122216 PMCID: PMC3009515 DOI: 10.1186/1471-2105-11-s1-s42] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022] Open
Abstract
Background Genomic data provide a wealth of new information for phylogenetic analysis. Yet making use of this data requires phylogenetic methods that can efficiently analyze extremely large data sets and account for processes of gene evolution, such as gene duplication and loss, incomplete lineage sorting (deep coalescence), or horizontal gene transfer, that cause incongruence among gene trees. One such approach is gene tree parsimony, which, given a set of gene trees, seeks a species tree that requires the smallest number of evolutionary events to explain the incongruence of the gene trees. However, the only existing algorithms for gene tree parsimony under the duplication-loss or deep coalescence reconciliation cost are prohibitively slow for large datasets. Results We describe novel algorithms for SPR and TBR based local search heuristics under the duplication-loss cost, and we show how they can be adapted for the deep coalescence cost. These algorithms improve upon the best existing algorithms for these problems by a factor of n, where n is the number of species in the collection of gene trees. We implemented our new SPR based local search algorithm for the duplication-loss cost and demonstrate the tremendous improvement in runtime and scalability it provides compared to existing implementations. We also evaluate the performance of our algorithm on three large-scale genomic data sets. Conclusion Our new algorithms enable, for the first time, gene tree parsimony analyses of thousands of genes from hundreds of taxa using the duplication-loss and deep coalescence reconciliation costs. Thus, this work expands both the size of data sets and the range of evolutionary models that can be incorporated into genome-scale phylogenetic analyses.
Collapse
Affiliation(s)
- Mukul S Bansal
- School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel.
| | | | | |
Collapse
|
11
|
Swenson MS, Barbançon F, Warnow T, Linder CR. A simulation study comparing supertree and combined analysis methods using SMIDGen. Algorithms Mol Biol 2010; 5:8. [PMID: 20047664 PMCID: PMC2837663 DOI: 10.1186/1748-7188-5-8] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2009] [Accepted: 01/04/2010] [Indexed: 11/16/2022] Open
Abstract
Background Supertree methods comprise one approach to reconstructing large molecular phylogenies given multi-marker datasets: trees are estimated on each marker and then combined into a tree (the "supertree") on the entire set of taxa. Supertrees can be constructed using various algorithmic techniques, with the most common being matrix representation with parsimony (MRP). When the data allow, the competing approach is a combined analysis (also known as a "supermatrix" or "total evidence" approach) whereby the different sequence data matrices for each of the different subsets of taxa are concatenated into a single supermatrix, and a tree is estimated on that supermatrix. Results In this paper, we describe an extensive simulation study we performed comparing two supertree methods, MRP and weighted MRP, to combined analysis methods on large model trees. A key contribution of this study is our novel simulation methodology (Super-Method Input Data Generator, or SMIDGen) that better reflects biological processes and the practices of systematists than earlier simulations. We show that combined analysis based upon maximum likelihood outperforms MRP and weighted MRP, giving especially big improvements when the largest subtree does not contain most of the taxa. Conclusions This study demonstrates that MRP and weighted MRP produce distinctly less accurate trees than combined analyses for a given base method (maximum parsimony or maximum likelihood). Since there are situations in which combined analyses are not feasible, there is a clear need for better supertree methods. The source tree and combined datasets used in this study can be used to test other supertree and combined analysis methods.
Collapse
|
12
|
Swenson MS, Suri R, Linder CR, Warnow T. An Experimental Study of Quartets MaxCut and Other Supertree Methods. LECTURE NOTES IN COMPUTER SCIENCE 2010. [DOI: 10.1007/978-3-642-15294-8_24] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
13
|
Bansal MS, Eulenstein O, Wehe A. The gene-duplication problem: near-linear time algorithms for NNI-based local searches. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2009; 6:221-231. [PMID: 19407347 DOI: 10.1109/tcbb.2009.7] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
The gene-duplication problem is to infer a species supertree from a collection of gene trees that are confounded by complex histories of gene-duplication events. This problem is NP-complete and thus requires efficient and effective heuristics. Existing heuristics perform a stepwise search of the tree space, where each step is guided by an exact solution to an instance of a local search problem. A classical local search problem is the {\tt NNI} search problem, which is based on the nearest neighbor interchange operation. In this work, we 1) provide a novel near-linear time algorithm for the {\tt NNI} search problem, 2) introduce extensions that significantly enlarge the search space of the {\tt NNI} search problem, and 3) present algorithms for these extended versions that are asymptotically just as efficient as our algorithm for the {\tt NNI} search problem. The exceptional speedup achieved in the extended {\tt NNI} search problems makes the gene-duplication problem more tractable for large-scale phylogenetic analyses. We verify the performance of our algorithms in a comparison study using sets of large randomly generated gene trees.
Collapse
Affiliation(s)
- Mukul S Bansal
- Department of Computer Science, Iowa StateUniversity, Ames, IA 50011, USA.
| | | | | |
Collapse
|
14
|
Abstract
BACKGROUND There is much interest in developing fast and accurate supertree methods to infer the tree of life. Supertree methods combine smaller input trees with overlapping sets of taxa to make a comprehensive phylogenetic tree that contains all of the taxa in the input trees. The intrinsically hard triplet supertree problem takes a collection of input species trees and seeks a species tree (supertree) that maximizes the number of triplet subtrees that it shares with the input trees. However, the utility of this supertree problem has been limited by a lack of efficient and effective heuristics. RESULTS We introduce fast hill-climbing heuristics for the triplet supertree problem that perform a step-wise search of the tree space, where each step is guided by an exact solution to an instance of a local search problem. To realize time efficient heuristics we designed the first nontrivial algorithms for two standard search problems, which greatly improve on the time complexity to the best known (naïve) solutions by a factor of n and n2 (the number of taxa in the supertree). These algorithms enable large-scale supertree analyses based on the triplet supertree problem that were previously not possible. We implemented hill-climbing heuristics that are based on our new algorithms, and in analyses of two published supertree data sets, we demonstrate that our new heuristics outperform other standard supertree methods in maximizing the number of triplets shared with the input trees. CONCLUSION With our new heuristics, the triplet supertree problem is now computationally more tractable for large-scale supertree analyses, and it provides a potentially more accurate alternative to existing supertree methods.
Collapse
Affiliation(s)
- Harris T Lin
- Department of Computer Science, Iowa State University, Ames, IA, USA.
| | | | | |
Collapse
|
15
|
Scornavacca C, Berry V, Lefort V, Douzery EJP, Ranwez V. PhySIC_IST: cleaning source trees to infer more informative supertrees. BMC Bioinformatics 2008; 9:413. [PMID: 18834542 PMCID: PMC2576265 DOI: 10.1186/1471-2105-9-413] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2008] [Accepted: 10/04/2008] [Indexed: 11/23/2022] Open
Abstract
Background Supertree methods combine phylogenies with overlapping sets of taxa into a larger one. Topological conflicts frequently arise among source trees for methodological or biological reasons, such as long branch attraction, lateral gene transfers, gene duplication/loss or deep gene coalescence. When topological conflicts occur among source trees, liberal methods infer supertrees containing the most frequent alternative, while veto methods infer supertrees not contradicting any source tree, i.e. discard all conflicting resolutions. When the source trees host a significant number of topological conflicts or have a small taxon overlap, supertree methods of both kinds can propose poorly resolved, hence uninformative, supertrees. Results To overcome this problem, we propose to infer non-plenary supertrees, i.e. supertrees that do not necessarily contain all the taxa present in the source trees, discarding those whose position greatly differs among source trees or for which insufficient information is provided. We detail a variant of the PhySIC veto method called PhySIC_IST that can infer non-plenary supertrees. PhySIC_IST aims at inferring supertrees that satisfy the same appealing theoretical properties as with PhySIC, while being as informative as possible under this constraint. The informativeness of a supertree is estimated using a variation of the CIC (Cladistic Information Content) criterion, that takes into account both the presence of multifurcations and the absence of some taxa. Additionally, we propose a statistical preprocessing step called STC (Source Trees Correction) to correct the source trees prior to the supertree inference. STC is a liberal step that removes the parts of each source tree that significantly conflict with other source trees. Combining STC with a veto method allows an explicit trade-off between veto and liberal approaches, tuned by a single parameter. Performing large-scale simulations, we observe that STC+PhySIC_IST infers much more informative supertrees than PhySIC, while preserving low type I error compared to the well-known MRP method. Two biological case studies on animals confirm that the STC preprocess successfully detects anomalies in the source trees while STC+PhySIC_IST provides well-resolved supertrees agreeing with current knowledge in systematics. Conclusion The paper introduces and tests two new methodologies, PhySIC_IST and STC, that demonstrate the interest in inferring non-plenary supertrees as well as preprocessing the source trees. An implementation of the methods is available at: .
Collapse
Affiliation(s)
- Celine Scornavacca
- Institut des Sciences de l'Evolution (ISEM, UMR 5554 CNRS), Université Montpellier II, Place E, Bataillon - CC 064 - 34095 Montpellier Cedex 5, France.
| | | | | | | | | |
Collapse
|
16
|
Bansal MS, Eulenstein O. An Omega(n2/ log n) speed-up of TBR heuristics for the gene-duplication problem. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2008; 5:514-524. [PMID: 18989039 DOI: 10.1109/tcbb.2008.69] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
The gene-duplication problem is to infer a species supertree from gene trees that are confounded by complex histories of gene duplications. This problem is NP-hard and thus requires efficient and effective heuristics. Existing heuristics perform a stepwise search of the tree space, where each step is guided by an exact solution to an instance of a local search problem. We improve on the time complexity of the local search problem by a factor of n2= log n, where n is the size of the resulting species supertree. Typically, several thousand instances of the local search problem are solved throughout a stepwise heuristic search. Hence, our improvement makes the gene-duplication problem much more tractable for large-scale phylogenetic analyses.
Collapse
Affiliation(s)
- Mukul S Bansal
- Department of Computer Science, Iowa State University, Ames, IA 50011, USA
| | | |
Collapse
|
17
|
Bansal MS, Eulenstein O. An Ω(n 2/logn) Speed-Up of Heuristics for the Gene-Duplication Problem. LECTURE NOTES IN COMPUTER SCIENCE 2007. [DOI: 10.1007/978-3-540-74126-8_12] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
|