1
|
Li W, Koshkarov A, Tahiri N. Comparison of phylogenetic trees defined on different but mutually overlapping sets of taxa: A review. Ecol Evol 2024; 14:e70054. [PMID: 39119174 PMCID: PMC11307105 DOI: 10.1002/ece3.70054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2023] [Revised: 07/03/2024] [Accepted: 07/10/2024] [Indexed: 08/10/2024] Open
Abstract
Phylogenetic trees represent the evolutionary relationships and ancestry of various species or groups of organisms. Comparing these trees by measuring the distance between them is essential for applications such as tree clustering and the Tree of Life project. Many distance metrics for phylogenetic trees focus on trees defined on the same set of taxa. However, some problems require calculating distances between trees with different but overlapping sets of taxa. This study reviews state-of-the-art distance measures for such trees, covering six major approaches, including the constraint-based Robinson-Foulds (RF) distance RF(-), the completion-based RF(+), the generalized RF (GRF), the dissimilarity measure, the vectorial tree distance, and the geodesic distance in the extended Billera-Holmes-Vogtmann tree space. Among these, three RF-based methods, RF(-), RF(+), and GRF, were examined in detail on generated clusters of phylogenetic trees defined on different but mutually overlapping sets of taxa. Additionally, we reviewed nine related techniques, including leaf imputation methods, the tree edit distance, and visual comparison. A comparison of the related distance measures, highlighting their principal advantages and shortcomings, is provided. This review offers valuable insights into their applicability and performance, guiding the appropriate use of these metrics based on tree type (rooted or unrooted) and information type (topological or branch lengths).
Collapse
Affiliation(s)
- Wanlin Li
- Department of Computer ScienceUniversity of SherbrookeSherbrookeQuebecCanada
| | - Aleksandr Koshkarov
- Department of Computer ScienceUniversity of SherbrookeSherbrookeQuebecCanada
| | - Nadia Tahiri
- Department of Computer ScienceUniversity of SherbrookeSherbrookeQuebecCanada
| |
Collapse
|
2
|
Zou Y, Zhang Z, Zeng Y, Hu H, Hao Y, Huang S, Li B. Common Methods for Phylogenetic Tree Construction and Their Implementation in R. Bioengineering (Basel) 2024; 11:480. [PMID: 38790347 PMCID: PMC11117635 DOI: 10.3390/bioengineering11050480] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Revised: 05/04/2024] [Accepted: 05/07/2024] [Indexed: 05/26/2024] Open
Abstract
A phylogenetic tree can reflect the evolutionary relationships between species or gene families, and they play a critical role in modern biological research. In this review, we summarize common methods for constructing phylogenetic trees, including distance methods, maximum parsimony, maximum likelihood, Bayesian inference, and tree-integration methods (supermatrix and supertree). Here we discuss the advantages, shortcomings, and applications of each method and offer relevant codes to construct phylogenetic trees from molecular data using packages and algorithms in R. This review aims to provide comprehensive guidance and reference for researchers seeking to construct phylogenetic trees while also promoting further development and innovation in this field. By offering a clear and concise overview of the different methods available, we hope to enable researchers to select the most appropriate approach for their specific research questions and datasets.
Collapse
Affiliation(s)
- Yue Zou
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (Y.Z.); (Z.Z.); (Y.Z.); (H.H.); (Y.H.)
| | - Zixuan Zhang
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (Y.Z.); (Z.Z.); (Y.Z.); (H.H.); (Y.H.)
| | - Yujie Zeng
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (Y.Z.); (Z.Z.); (Y.Z.); (H.H.); (Y.H.)
| | - Hanyue Hu
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (Y.Z.); (Z.Z.); (Y.Z.); (H.H.); (Y.H.)
| | - Youjin Hao
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (Y.Z.); (Z.Z.); (Y.Z.); (H.H.); (Y.H.)
| | - Sheng Huang
- Animal Nutrition Institute, Chongqing Academy of Animal Science, Chongqing 402460, China
| | - Bo Li
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (Y.Z.); (Z.Z.); (Y.Z.); (H.H.); (Y.H.)
| |
Collapse
|
3
|
Markin A, Wagle S, Anderson TK, Eulenstein O. RF-Net 2: fast inference of virus reassortment and hybridization networks. Bioinformatics 2022; 38:2144-2152. [PMID: 35150239 PMCID: PMC9004648 DOI: 10.1093/bioinformatics/btac075] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Revised: 01/26/2022] [Accepted: 02/07/2022] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION A phylogenetic network is a powerful model to represent entangled evolutionary histories with both divergent (speciation) and convergent (e.g. hybridization, reassortment, recombination) evolution. The standard approach to inference of hybridization networks is to (i) reconstruct rooted gene trees and (ii) leverage gene tree discordance for network inference. Recently, we introduced a method called RF-Net for accurate inference of virus reassortment and hybridization networks from input gene trees in the presence of errors commonly found in phylogenetic trees. While RF-Net demonstrated the ability to accurately infer networks with up to four reticulations from erroneous input gene trees, its application was limited by the number of reticulations it could handle in a reasonable amount of time. This limitation is particularly restrictive in the inference of the evolutionary history of segmented RNA viruses such as influenza A virus (IAV), where reassortment is one of the major mechanisms shaping the evolution of these pathogens. RESULTS Here, we expand the functionality of RF-Net that makes it significantly more applicable in practice. Crucially, we introduce a fast extension to RF-Net, called Fast-RF-Net, that can handle large numbers of reticulations without sacrificing accuracy. In addition, we develop automatic stopping criteria to select the appropriate number of reticulations heuristically and implement a feature for RF-Net to output error-corrected input gene trees. We then conduct a comprehensive study of the original method and its novel extensions and confirm their efficacy in practice using extensive simulation and empirical IAV evolutionary analyses. AVAILABILITY AND IMPLEMENTATION RF-Net 2 is available at https://github.com/flu-crew/rf-net-2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Alexey Markin
- Virus and Prion Research Unit, National Animal Disease Center, USDA-ARS, Ames, IA 50010, USA
| | - Sanket Wagle
- Department of Computer Science, Iowa State University, Ames, IA 50011, USA
| | - Tavis K Anderson
- Virus and Prion Research Unit, National Animal Disease Center, USDA-ARS, Ames, IA 50010, USA
| | - Oliver Eulenstein
- Department of Computer Science, Iowa State University, Ames, IA 50011, USA
| |
Collapse
|
4
|
Silva AS, Wilkinson M. On Defining and Finding Islands of Trees and Mitigating Large Island Bias. Syst Biol 2021; 70:1282-1294. [PMID: 33749752 PMCID: PMC8513764 DOI: 10.1093/sysbio/syab015] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2020] [Accepted: 02/24/2021] [Indexed: 11/12/2022] Open
Abstract
How best can we summarize sets of phylogenetic trees? Systematists have relied heavily on consensus methods, but if tree distributions can be partitioned into distinct subsets, it may be helpful to provide separate summaries of these rather than relying entirely upon a single consensus tree. How sets of trees can most helpfully be partitioned and represented leads to many open questions, but one natural partitioning is provided by the islands of trees found during tree searches. Islands that are of dissimilar size have been shown to yield majority-rule consensus trees dominated by the largest sets We illustrate this large island bias and approaches that mitigate its impact by revisiting a recent analysis of phylogenetic relationships of living and fossil amphibians. We introduce a revised definition of tree islands based on any tree-to-tree pairwise distance metric that usefully extends the notion to any set or multiset of trees, as might be produced by, for example, Bayesian or bootstrap methods, and that facilitates finding tree islands a posteriori. We extract islands from a tree distribution obtained in a Bayesian analysis of the amphibian data to investigate their impact in that context, and we compare the partitioning produced by tree islands with those resulting from some alternative approaches. Distinct subsets of trees, such as tree islands, should be of interest because of what they may reveal about evolution and/or our attempts to understand it, and are an important, sometimes overlooked, consideration when building and interpreting consensus trees. [Amphibia; Bayesian inference; consensus; parsimony; partitions; phylogeny; Chinlestegophis.].
Collapse
Affiliation(s)
- Ana Serra Silva
- Department of Life Sciences, The Natural History Museum, London SW7 5BD, UK
- School of Earth Sciences, University of Bristol, Bristol BS8 1RL, UK
| | - Mark Wilkinson
- Department of Life Sciences, The Natural History Museum, London SW7 5BD, UK
| |
Collapse
|
5
|
Yu X, Le T, Christensen SA, Molloy EK, Warnow T. Using Robinson-Foulds supertrees in divide-and-conquer phylogeny estimation. Algorithms Mol Biol 2021; 16:12. [PMID: 34183037 PMCID: PMC8240396 DOI: 10.1186/s13015-021-00189-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Accepted: 06/05/2021] [Indexed: 12/01/2022] Open
Abstract
One of the Grand Challenges in Science is the construction of the Tree of Life, an evolutionary tree containing several million species, spanning all life on earth. However, the construction of the Tree of Life is enormously computationally challenging, as all the current most accurate methods are either heuristics for NP-hard optimization problems or Bayesian MCMC methods that sample from tree space. One of the most promising approaches for improving scalability and accuracy for phylogeny estimation uses divide-and-conquer: a set of species is divided into overlapping subsets, trees are constructed on the subsets, and then merged together using a "supertree method". Here, we present Exact-RFS-2, the first polynomial-time algorithm to find an optimal supertree of two trees, using the Robinson-Foulds Supertree (RFS) criterion (a major approach in supertree estimation that is related to maximum likelihood supertrees), and we prove that finding the RFS of three input trees is NP-hard. Exact-RFS-2 is available in open source form on Github at https://github.com/yuxilin51/GreedyRFS .
Collapse
Affiliation(s)
| | - Thien Le
- Department of EECS, Massachusetts Institute of Technology, Cambridge, USA
| | - Sarah A. Christensen
- Computer Science Department, University of Illinois at Urbana-Champaign, Urbana, USA
| | - Erin K. Molloy
- Computer Science Department, University of Illinois at Urbana-Champaign, Urbana, USA
| | - Tandy Warnow
- Computer Science Department, University of Illinois at Urbana-Champaign, Urbana, USA
| |
Collapse
|
6
|
Hughes L, Rosenblatt B, Haddad F, Gissane C, McCarthy D, Clarke T, Ferris G, Dawes J, Paton B, Patterson SD. Comparing the Effectiveness of Blood Flow Restriction and Traditional Heavy Load Resistance Training in the Post-Surgery Rehabilitation of Anterior Cruciate Ligament Reconstruction Patients: A UK National Health Service Randomised Controlled Trial. Sports Med 2020; 49:1787-1805. [PMID: 31301034 DOI: 10.1007/s40279-019-01137-2] [Citation(s) in RCA: 122] [Impact Index Per Article: 30.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
BACKGROUND We implemented a blood flow restriction resistance training (BFR-RT) intervention during an 8-week rehabilitation programme in anterior cruciate ligament reconstruction (ACLR) patients within a National Health Service setting. OBJECTIVE To compare the effectiveness of BFR-RT and standard-care traditional heavy-load resistance training (HL-RT) at improving skeletal muscle hypertrophy and strength, physical function, pain and effusion in ACLR patients following surgery. METHODS 28 patients scheduled for unilateral ACLR surgery with hamstring autograft were recruited for this parallel-group, two-arm, single-assessor blinded, randomised clinical trial following appropriate power analysis. Following surgery, a criteria-driven approach to rehabilitation was utilised and participants were block randomised to either HL-RT at 70% repetition maximum (1RM) (n = 14) or BFR-RT (n = 14) at 30% 1RM. Participants completed 8 weeks of biweekly unilateral leg press training on both limbs, totalling 16 sessions, alongside standard hospital rehabilitation. Resistance exercise protocols were designed consistent with standard recommended protocols for each type of exercise. Scaled maximal isotonic strength (10RM), muscle morphology of the vastus lateralis of the injured limb, self-reported function, Y-balance test performance and knee joint pain, effusion and range of motion (ROM) were assessed at pre-surgery, post-surgery, mid-training and post-training. Knee joint laxity and scaled maximal isokinetic knee extension and flexion strength at 60°/s, 150°/s and 300°/s were measured at pre-surgery and post-training. RESULTS Four participants were lost, with 24 participants completing the study (12 per group). There were no adverse events or differences between groups for any baseline anthropometric variable or pre- to post-surgery change in any outcome measure. Scaled 10RM strength significantly increased in the injured limb (104 ± 30% and 106 ± 43%) and non-injured limb (33 ± 13% and 39 ± 17%) with BFR-RT and HL-RT, respectively, with no group differences. Significant increases in knee extension and flexion peak torque were observed at all speeds in the non-injured limb with no group differences. Significantly greater attenuation of knee extensor peak torque loss at 150°/s and 300°/s and knee flexor torque loss at all speeds was observed with BFR-RT. No group differences in knee extensor peak torque loss were found at 60°/s. Significant and comparable increases in muscle thickness (5.8 ± 0.2% and 6.7 ± 0.3%) and pennation angle (4.1 ± 0.3% and 3.4 ± 0.1%) were observed with BFR-RT and HL-RT, respectively, with no group differences. No significant changes in fascicle length were observed. Significantly greater and clinically important increases in several measures of self-reported function (50-218 ± 48% vs. 35-152 ± 56%), Y-balance performance (18-59 ± 22% vs. 18-33 ± 19%), ROM (78 ± 22% vs. 48 ± 13%) and reductions in knee joint pain (67 ± 15% vs. 39 ± 12%) and effusion (6 ± 2% vs. 2 ± 2%) were observed with BFR-RT compared to HL-RT, respectively. CONCLUSION BFR-RT can improve skeletal muscle hypertrophy and strength to a similar extent to HL-RT with a greater reduction in knee joint pain and effusion, leading to greater overall improvements in physical function. Therefore, BFR-RT may be more appropriate for early rehabilitation in ACLR patient populations within the National Health Service.
Collapse
Affiliation(s)
- Luke Hughes
- School of Sport, Health and Applied Science, St Mary's University, London, TW1 4SX, UK.,Institute of Sport, Exercise and Health, 170 Tottenham Court Road, London, UK
| | | | - Fares Haddad
- Institute of Sport, Exercise and Health, 170 Tottenham Court Road, London, UK
| | - Conor Gissane
- School of Sport, Health and Applied Science, St Mary's University, London, TW1 4SX, UK
| | | | | | | | - Joanna Dawes
- University College London, Bloomsbury, London, UK
| | - Bruce Paton
- Institute of Sport, Exercise and Health, 170 Tottenham Court Road, London, UK.
| | | |
Collapse
|
7
|
Bansal MS. Linear-time algorithms for phylogenetic tree completion under Robinson-Foulds distance. Algorithms Mol Biol 2020; 15:6. [PMID: 32313549 PMCID: PMC7155338 DOI: 10.1186/s13015-020-00166-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2020] [Accepted: 04/04/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We consider two fundamental computational problems that arise when comparing phylogenetic trees, rooted or unrooted, with non-identical leaf sets. The first problem arises when comparing two trees where the leaf set of one tree is a proper subset of the other. The second problem arises when the two trees to be compared have only partially overlapping leaf sets. The traditional approach to handling these problems is to first restrict the two trees to their common leaf set. An alternative approach that has shown promise is to first complete the trees by adding missing leaves, so that the resulting trees have identical leaf sets. This requires the computation of an optimal completion that minimizes the distance between the two resulting trees over all possible completions. RESULTS We provide optimal linear-time algorithms for both completion problems under the widely-used Robinson-Foulds (RF) distance measure. Our algorithm for the first problem improves the time complexity of the current fastest algorithm from quadratic (in the size of the two trees) to linear. No algorithms have yet been proposed for the more general second problem where both trees have missing leaves. We advance the study of this general problem by proposing a useful restricted version of the general problem and providing optimal linear-time algorithms for the restricted version. Our experimental results on biological data sets suggest that completion-based RF distances can be very different compared to traditional RF distances.
Collapse
|
8
|
Markin A, Eulenstein O. Cophenetic Median Trees. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1459-1470. [PMID: 30222583 DOI: 10.1109/tcbb.2018.2870173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Median tree inference under path-difference metrics has shown great promise for large-scale phylogeny estimation. Similar to these metrics is the family of cophenetic metrics that originates from a classic dendrogram comparison method introduced more than 50 years ago. Despite the appeal of this family of metrics, the problem of computing median trees under cophenetic metrics has not been analyzed. Like other standard median tree problems relevant in practice, as we show here, this problem is also NP-hard. NP-hard median tree problems have been successfully addressed by local search heuristics that are solving thousands of instances of a corresponding (local neighborhood) search problem. For the local neighborhood search problem under a cophenetic metric, the best known (naïve) algorithm has a time complexity that is typically prohibitive for effective heuristic searches. Building on the pioneering work on path-difference median trees, we develop efficient algorithms for Manhattan and Euclidean cophenetic search problems that improve on the naïve solution by a linear and a quadratic factor, respectively. We demonstrate the performance and effectiveness of the resulting heuristic methods in a comparative study using benchmark empirical datasets.
Collapse
|
9
|
Markin A, Eulenstein O. Efficient Local Search for Euclidean Path-Difference Median Trees. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1374-1385. [PMID: 29035224 DOI: 10.1109/tcbb.2017.2763137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Synthesizing large-scale phylogenetic trees is a fundamental problem in evolutionary biology. Median tree problems have evolved as a powerful tool to reconstruct such trees. Such problems seek a median tree for a given collection of input trees under some problem-specific tree distance. There has been an increased interest in the median tree problem for the classical path-difference distance between trees. While this problem is NP-hard, standard local search heuristics have been described that are based on solving a local search problem exactly. For a more effective heuristic we devise a time efficient algorithm for the local search problem that improves on the best-known solution by a factor of $n$n, where $n$n is the size of the input trees. Furthermore, we introduce a novel hybrid version of the standard local search that is exploiting our new algorithm for a more refined heuristic search. Finally, we demonstrate the performance of our hybrid heuristic in a comparative study with other commonly used methods that synthesize species trees using published empirical data sets.
Collapse
|
10
|
Fleischauer M, Böcker S. BCD Beam Search: considering suboptimal partial solutions in Bad Clade Deletion supertrees. PeerJ 2018; 6:e4987. [PMID: 29900080 PMCID: PMC5995099 DOI: 10.7717/peerj.4987] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2018] [Accepted: 05/26/2018] [Indexed: 11/20/2022] Open
Abstract
Supertree methods enable the reconstruction of large phylogenies. The supertree problem can be formalized in different ways in order to cope with contradictory information in the input. Some supertree methods are based on encoding the input trees in a matrix; other methods try to find minimum cuts in some graph. Recently, we introduced Bad Clade Deletion (BCD) supertrees which combines the graph-based computation of minimum cuts with optimizing a global objective function on the matrix representation of the input trees. The BCD supertree method has guaranteed polynomial running time and is very swift in practice. The quality of reconstructed supertrees was superior to matrix representation with parsimony (MRP) and usually on par with SuperFine for simulated data; but particularly for biological data, quality of BCD supertrees could not keep up with SuperFine supertrees. Here, we present a beam search extension for the BCD algorithm that keeps alive a constant number of partial solutions in each top-down iteration phase. The guaranteed worst-case running time of the new algorithm is still polynomial in the size of the input. We present an exact and a randomized subroutine to generate suboptimal partial solutions. Both beam search approaches consistently improve supertree quality on all evaluated datasets when keeping 25 suboptimal solutions alive. Supertree quality of the BCD Beam Search algorithm is on par with MRP and SuperFine even for biological data. This is the best performance of a polynomial-time supertree algorithm reported so far.
Collapse
Affiliation(s)
| | - Sebastian Böcker
- Chair for Bioinformatics, Friedrich-Schiller-University, Jena, Germany
| |
Collapse
|
11
|
Jansson J, Rajaby R, Shen C, Sung WK. Algorithms for the Majority Rule (+) Consensus Tree and the Frequency Difference Consensus Tree. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:15-26. [PMID: 27662679 DOI: 10.1109/tcbb.2016.2609923] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
This article presents two new deterministic algorithms for constructing consensus trees. Given an input of phylogenetic trees with identical leaf label sets and leaves each, the first algorithm constructs the majority rule (+) consensus tree in time, which is optimal since the input size is , and the second one constructs the frequency difference consensus tree in time.
Collapse
|
12
|
Abstract
Supertree methods merge a set of overlapping phylogenetic trees into a supertree containing all taxa of the input trees. The challenge in supertree reconstruction is the way of dealing with conflicting information in the input trees. Many different algorithms for different objective functions have been suggested to resolve these conflicts. In particular, there exist methods based on encoding the source trees in a matrix, where the supertree is constructed applying a local search heuristic to optimize the respective objective function. We present a novel heuristic supertree algorithm called Bad Clade Deletion (BCD) supertrees. It uses minimum cuts to delete a locally minimal number of columns from such a matrix representation so that it is compatible. This is the complement problem to Matrix Representation with Compatibility (Maximum Split Fit). Our algorithm has guaranteed polynomial worst-case running time and performs swiftly in practice. Different from local search heuristics, it guarantees to return the directed perfect phylogeny for the input matrix, corresponding to the parent tree of the input trees, if one exists. Comparing supertrees to model trees for simulated data, BCD shows a better accuracy (F1 score) than the state-of-the-art algorithms SuperFine (up to 3%) and Matrix Representation with Parsimony (up to 7%); at the same time, BCD is up to 7 times faster than SuperFine, and up to 600 times faster than Matrix Representation with Parsimony. Finally, using the BCD supertree as a starting tree for a combined Maximum Likelihood analysis using RAxML, we reach significantly improved accuracy (1% higher F1 score) and running time (1.7-fold speedup).
Collapse
Affiliation(s)
- Markus Fleischauer
- Chair for Bioinformatics, Institute for Computer Science, Friedrich-Schiller-University Jena, Jena, Germany
| | - Sebastian Böcker
- Chair for Bioinformatics, Institute for Computer Science, Friedrich-Schiller-University Jena, Jena, Germany
| |
Collapse
|
13
|
McMorris FR, Powers RC. Some axiomatic limitations for consensus and supertree functions on hierarchies. J Theor Biol 2016; 404:342-347. [PMID: 27320681 DOI: 10.1016/j.jtbi.2016.06.016] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2016] [Revised: 06/01/2016] [Accepted: 06/13/2016] [Indexed: 10/21/2022]
Abstract
Consensus trees and supertrees are regularly used in systematic biology in order to obtain a summary for the common agreement of the evolutionary relationships among a collection of phylogenetic trees (hierarchies). When every tree is defined on the same set of taxa then consensus functions are used, while if the trees are defined on different sets then supertree functions are used. For both of these situations we will consider some of the limitations that might arise from the placing of singularly reasonable and apparently innocuous conditions on the functions. Previous work is reviewed together with new material. In particular, we consider the impact of axioms requiring that the removal or addition of a tree that contains no, or no new, branching information should not affect the outcome.
Collapse
Affiliation(s)
- F R McMorris
- Department of Applied Mathematics, Illinois Institute of Technology, Chicago, IL 60616, United States; Department of Mathematics, University of Louisville, Louisville, KY 40292, United States.
| | - Robert C Powers
- Department of Mathematics, University of Louisville, Louisville, KY 40292, United States
| |
Collapse
|
14
|
Goloboff PA, Szumik CA. Problems with supertrees based on the subtree prune-and-regraft distance, with comments on majority rule supertrees. Cladistics 2016; 32:82-89. [PMID: 34732022 DOI: 10.1111/cla.12111] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/12/2014] [Indexed: 11/26/2022] Open
Abstract
This paper examines a recent proposal to calculate supertrees by minimizing the sum of subtree prune-and-regraft distances to the input trees. The supertrees thus calculated may display groups present in a minority of the input trees but contradicted by the majority, or groups that are not supported by any input tree or combination of input trees. The proponents of the method themselves stated that these are serious problems of "matrix representation with parsimony", but they can in fact occur in their own method. The majority rule supertrees, being explicitly clade-based, cannot have these problems, and seem much more suited to retrieving common clades from a set of trees with different taxon sets. However, it is dubious that so-called majority rule supertrees can always be interpreted as displaying those clades present (or compatible with) with a majority of the trees. The majority rule consensus is always a median tree, in terms of the Robinson-Foulds distances (i.e. it minimizes the sum of Robinson-Foulds distances to the input trees). In contrast, majority rule supertrees may not be median-different, contradictory trees may minimize Robinson-Foulds distances, while their strict consensus does not. If being "majority" results from being median in Robinson-Foulds distances, this means that in the supertree setting a "majority" is ambiguously defined, sometimes achievable only by mutually contradictory trees.
Collapse
Affiliation(s)
- Pablo A Goloboff
- Unidad Ejecutora Lillo, Consejo Nacional de Investigaciones Científicas y Técnicas, Miguel Lillo 251, 4000, S.M. de Tucumán, Argentina
| | - Claudia A Szumik
- Unidad Ejecutora Lillo, Consejo Nacional de Investigaciones Científicas y Técnicas, Miguel Lillo 251, 4000, S.M. de Tucumán, Argentina
| |
Collapse
|
15
|
Lafond M, Ouangraoua A, El-Mabrouk N. Reconstructing a SuperGeneTree minimizing reconciliation. BMC Bioinformatics 2015; 16 Suppl 14:S4. [PMID: 26451911 PMCID: PMC4602317 DOI: 10.1186/1471-2105-16-s14-s4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
Combining a set of trees on partial datasets into a single tree is a classical method for inferring large phylogenetic trees. Ideally, the combined tree should display each input partial tree, which is only possible if input trees do not contain contradictory phylogenetic information. The simplest version of the supertree problem is thus to state whether a set of trees is compatible, and if so, construct a tree displaying them all. Classically, supertree methods have been applied to the reconstruction of species trees. Here we rather consider reconstructing a super gene tree in light of a known species tree S. We define the supergenetree problem as finding, among all supertrees displaying a set of input gene trees, one supertree minimizing a reconciliation distance with S. We first show how classical exact methods to the supertree problem can be extended to the supergenetree problem. As all these methods are highly exponential, we also exhibit a natural greedy heuristic for the duplication cost, based on minimizing the set of duplications preceding the first speciation event. We then show that both the supergenetree problem and its restriction to minimizing duplications preceding the first speciation are NP-hard to approximate within a n1-ϵ factor, for any 0 < ϵ < 1. Finally, we show that a restriction of this problem to uniquely labeled speciation gene trees, which is relevant to many biological applications, is also NP-hard. Therefore, we introduce new avenues in the field of supertrees, and set the theoretical basis for the exploration of various algorithmic aspects of the problems.
Collapse
|
16
|
Akanni WA, Wilkinson M, Creevey CJ, Foster PG, Pisani D. Implementing and testing Bayesian and maximum-likelihood supertree methods in phylogenetics. ROYAL SOCIETY OPEN SCIENCE 2015; 2:140436. [PMID: 26361544 PMCID: PMC4555849 DOI: 10.1098/rsos.140436] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/10/2014] [Accepted: 07/06/2015] [Indexed: 05/14/2023]
Abstract
Since their advent, supertrees have been increasingly used in large-scale evolutionary studies requiring a phylogenetic framework and substantial efforts have been devoted to developing a wide variety of supertree methods (SMs). Recent advances in supertree theory have allowed the implementation of maximum likelihood (ML) and Bayesian SMs, based on using an exponential distribution to model incongruence between input trees and the supertree. Such approaches are expected to have advantages over commonly used non-parametric SMs, e.g. matrix representation with parsimony (MRP). We investigated new implementations of ML and Bayesian SMs and compared these with some currently available alternative approaches. Comparisons include hypothetical examples previously used to investigate biases of SMs with respect to input tree shape and size, and empirical studies based either on trees harvested from the literature or on trees inferred from phylogenomic scale data. Our results provide no evidence of size or shape biases and demonstrate that the Bayesian method is a viable alternative to MRP and other non-parametric methods. Computation of input tree likelihoods allows the adoption of standard tests of tree topologies (e.g. the approximately unbiased test). The Bayesian approach is particularly useful in providing support values for supertree clades in the form of posterior probabilities.
Collapse
Affiliation(s)
- Wasiu A. Akanni
- Department of Biology, The National University of Ireland, Maynooth, Co. Kildare, Republic of Ireland
- Department of Life Science, The Natural History Museum, London SW7 5BD, UK
| | - Mark Wilkinson
- Department of Life Science, The Natural History Museum, London SW7 5BD, UK
| | - Christopher J. Creevey
- Institute of Biological, Environmental and Rural Sciences (IBERS), Aberystwyth University, Aberystwyth, Ceredigion SY23 3FG, UK
| | - Peter G. Foster
- Department of Life Science, The Natural History Museum, London SW7 5BD, UK
| | - Davide Pisani
- School of Biological Sciences and School of Earth Sciences, University of Bristol, Life Sciences Building, 24 Tyndall Avenue, Bristol BS8 1TG, UK
- Author for correspondence: Davide Pisani e-mail:
| |
Collapse
|
17
|
Whidden C, Zeh N, Beiko RG. Supertrees Based on the Subtree Prune-and-Regraft Distance. Syst Biol 2014; 63:566-81. [PMID: 24695589 PMCID: PMC4055872 DOI: 10.1093/sysbio/syu023] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2013] [Accepted: 03/18/2014] [Indexed: 11/14/2022] Open
Abstract
Supertree methods reconcile a set of phylogenetic trees into a single structure that is often interpreted as a branching history of species. A key challenge is combining conflicting evolutionary histories that are due to artifacts of phylogenetic reconstruction and phenomena such as lateral gene transfer (LGT). Many supertree approaches use optimality criteria that do not reflect underlying processes, have known biases, and may be unduly influenced by LGT. We present the first method to construct supertrees by using the subtree prune-and-regraft (SPR) distance as an optimality criterion. Although calculating the rooted SPR distance between a pair of trees is NP-hard, our new maximum agreement forest-based methods can reconcile trees with hundreds of taxa and>50 transfers in fractions of a second, which enables repeated calculations during the course of an iterative search. Our approach can accommodate trees in which uncertain relationships have been collapsed to multifurcating nodes. Using a series of benchmark datasets simulated under plausible rates of LGT, we show that SPR supertrees are more similar to correct species histories than supertrees based on parsimony or Robinson-Foulds distance criteria. We successfully constructed an SPR supertree from a phylogenomic dataset of 40,631 gene trees that covered 244 genomes representing several major bacterial phyla. Our SPR-based approach also allowed direct inference of highways of gene transfer between bacterial classes and genera. A Small number of these highways connect genera in different phyla and can highlight specific genes implicated in long-distance LGT. [Lateral gene transfer; matrix representation with parsimony; phylogenomics; prokaryotic phylogeny; Robinson-Foulds; subtree prune-and-regraft; supertrees.].
Collapse
Affiliation(s)
- Christopher Whidden
- Faculty of Computer Science, Dalhousie University, 6050 University Avenue, PO Box 15000, Halifax, Nova Scotia, Canada B3H 4R2
| | - Norbert Zeh
- Faculty of Computer Science, Dalhousie University, 6050 University Avenue, PO Box 15000, Halifax, Nova Scotia, Canada B3H 4R2
| | - Robert G Beiko
- Faculty of Computer Science, Dalhousie University, 6050 University Avenue, PO Box 15000, Halifax, Nova Scotia, Canada B3H 4R2
| |
Collapse
|
18
|
Akanni WA, Creevey CJ, Wilkinson M, Pisani D. L.U.St: a tool for approximated maximum likelihood supertree reconstruction. BMC Bioinformatics 2014; 15:183. [PMID: 24925766 PMCID: PMC4073192 DOI: 10.1186/1471-2105-15-183] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2014] [Accepted: 06/02/2014] [Indexed: 12/29/2022] Open
Abstract
Background Supertrees combine disparate, partially overlapping trees to generate a synthesis that provides a high level perspective that cannot be attained from the inspection of individual phylogenies. Supertrees can be seen as meta-analytical tools that can be used to make inferences based on results of previous scientific studies. Their meta-analytical application has increased in popularity since it was realised that the power of statistical tests for the study of evolutionary trends critically depends on the use of taxon-dense phylogenies. Further to that, supertrees have found applications in phylogenomics where they are used to combine gene trees and recover species phylogenies based on genome-scale data sets. Results Here, we present the L.U.St package, a python tool for approximate maximum likelihood supertree inference and illustrate its application using a genomic data set for the placental mammals. L.U.St allows the calculation of the approximate likelihood of a supertree, given a set of input trees, performs heuristic searches to look for the supertree of highest likelihood, and performs statistical tests of two or more supertrees. To this end, L.U.St implements a winning sites test allowing ranking of a collection of a-priori selected hypotheses, given as a collection of input supertree topologies. It also outputs a file of input-tree-wise likelihood scores that can be used as input to CONSEL for calculation of standard tests of two trees (e.g. Kishino-Hasegawa, Shimidoara-Hasegawa and Approximately Unbiased tests). Conclusion This is the first fully parametric implementation of a supertree method, it has clearly understood properties, and provides several advantages over currently available supertree approaches. It is easy to implement and works on any platform that has python installed. Availability: bitBucket page - https://afro-juju@bitbucket.org/afro-juju/l.u.st.git. Contact: Davide.Pisani@bristol.ac.uk.
Collapse
Affiliation(s)
| | | | | | - Davide Pisani
- Department of Biology, The National University of Ireland, Maynooth, Maynooth, Kildare, Ireland.
| |
Collapse
|
19
|
Abstract
Animals deploy various molecular sensors to detect pathogen infections. RIG-like receptor (RLR) proteins identify viral RNAs and initiate innate immune responses. The three human RLRs recognize different types of RNA molecules and protect against different viral pathogens. The RLR protein family is widely thought to have originated shortly before the emergence of vertebrates and rapidly diversified through a complex process of domain grafting. Contrary to these findings, here we show that full-length RLRs and their downstream signaling molecules were present in the earliest animals, suggesting that the RLR-based immune system arose with the emergence of multicellularity. Functional differentiation of RLRs occurred early in animal evolution via simple gene duplication followed by modifications of the RNA-binding pocket, many of which may have been adaptively driven. Functional analysis of human and ancestral RLRs revealed that the ancestral RLR displayed RIG-1-like RNA-binding. MDA5-like binding arose through changes in the RNA-binding pocket following the duplication of the ancestral RLR, which may have occurred either early in Bilateria or later, after deuterostomes split from protostomes. The sensitivity and specificity with which RLRs bind different RNA structures has repeatedly adapted throughout mammalian evolution, suggesting a long-term evolutionary arms race with viral RNA or other molecules.
Collapse
Affiliation(s)
| | - Bryan Korithoski
- Department of Microbiology and Cell Science, University of Florida
| | | |
Collapse
|
20
|
Chang WC, Górecki P, Eulenstein O. Exact solutions for species tree inference from discordant gene trees. J Bioinform Comput Biol 2013; 11:1342005. [PMID: 24131054 DOI: 10.1142/s0219720013420055] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Phylogenetic analysis has to overcome the grant challenge of inferring accurate species trees from evolutionary histories of gene families (gene trees) that are discordant with the species tree along whose branches they have evolved. Two well studied approaches to cope with this challenge are to solve either biologically informed gene tree parsimony (GTP) problems under gene duplication, gene loss, and deep coalescence, or the classic RF supertree problem that does not rely on any biological model. Despite the potential of these problems to infer credible species trees, they are NP-hard. Therefore, these problems are addressed by heuristics that typically lack any provable accuracy and precision. We describe fast dynamic programming algorithms that solve the GTP problems and the RF supertree problem exactly, and demonstrate that our algorithms can solve instances with data sets consisting of as many as 22 taxa. Extensions of our algorithms can also report the number of all optimal species trees, as well as the trees themselves. To better asses the quality of the resulting species trees that best fit the given gene trees, we also compute the worst case species trees, their numbers, and optimization score for each of the computational problems. Finally, we demonstrate the performance of our exact algorithms using empirical and simulated data sets, and analyze the quality of heuristic solutions for the studied problems by contrasting them with our exact solutions.
Collapse
|
21
|
Bansal MS, Eulenstein O. Algorithms for genome-scale phylogenetics using gene tree parsimony. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:939-956. [PMID: 24334388 DOI: 10.1109/tcbb.2013.103] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
The use of genomic data sets for phylogenetics is complicated by the fact that evolutionary processes such as gene duplication and loss, or incomplete lineage sorting (deep coalescence) cause incongruence among gene trees. One well-known approach that deals with this complication is gene tree parsimony, which, given a collection of gene trees, seeks a species tree that requires the smallest number of evolutionary events to explain the incongruence of the gene trees. However, a lack of efficient algorithms has limited the use of this approach. Here, we present efficient algorithms for SPR and TBR-based local search heuristics for gene tree parsimony under the 1) duplication, 2) loss, 3) duplication-loss, and 4) deep coalescence reconciliation costs. These novel algorithms improve upon the time complexities of previous algorithms for these problems by a factor of n, where n is the number of species in the collection of gene trees. Our algorithms provide a substantial improvement in runtime and scalability compared to previous implementations and enable large-scale gene tree parsimony analyses using any of the four reconciliation costs. Our algorithms have been implemented in the software packages DupTree and iGTP, and have already been used to perform several compelling phylogenetic studies.
Collapse
|
22
|
Beaulieu JM, Ree RH, Cavender-Bares J, Weiblen GD, Donoghue MJ. Synthesizing phylogenetic knowledge for ecological research. Ecology 2012. [DOI: 10.1890/11-0638.1] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
23
|
Nguyen N, Mirarab S, Warnow T. MRL and SuperFine+MRL: new supertree methods. Algorithms Mol Biol 2012; 7:3. [PMID: 22280525 PMCID: PMC3308190 DOI: 10.1186/1748-7188-7-3] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2011] [Accepted: 01/26/2012] [Indexed: 11/29/2022] Open
Abstract
Background Supertree methods combine trees on subsets of the full taxon set together to produce a tree on the entire set of taxa. Of the many supertree methods, the most popular is MRP (Matrix Representation with Parsimony), a method that operates by first encoding the input set of source trees by a large matrix (the "MRP matrix") over {0,1, ?}, and then running maximum parsimony heuristics on the MRP matrix. Experimental studies evaluating MRP in comparison to other supertree methods have established that for large datasets, MRP generally produces trees of equal or greater accuracy than other methods, and can run on larger datasets. A recent development in supertree methods is SuperFine+MRP, a method that combines MRP with a divide-and-conquer approach, and produces more accurate trees in less time than MRP. In this paper we consider a new approach for supertree estimation, called MRL (Matrix Representation with Likelihood). MRL begins with the same MRP matrix, but then analyzes the MRP matrix using heuristics (such as RAxML) for 2-state Maximum Likelihood. Results We compared MRP and SuperFine+MRP with MRL and SuperFine+MRL on simulated and biological datasets. We examined the MRP and MRL scores of each method on a wide range of datasets, as well as the resulting topological accuracy of the trees. Our experimental results show that MRL, coupled with a very good ML heuristic such as RAxML, produced more accurate trees than MRP, and MRL scores were more strongly correlated with topological accuracy than MRP scores. Conclusions SuperFine+MRP, when based upon a good MP heuristic, such as TNT, produces among the best scores for both MRP and MRL, and is generally faster and more topologically accurate than other supertree methods we tested.
Collapse
|
24
|
Swenson MS, Suri R, Linder CR, Warnow T. SuperFine: Fast and Accurate Supertree Estimation. Syst Biol 2011; 61:214-27. [DOI: 10.1093/sysbio/syr092] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- M. Shel Swenson
- Department of Computer Science, The University of Texas at Austin, Austin, TX, USA
| | - Rahul Suri
- Department of Computer Science, The University of Texas at Austin, Austin, TX, USA
| | - C. Randal Linder
- Section of Integrative Biology, School of Biological Sciences, The University of Texas at Austin, Austin, TX, USA
| | - Tandy Warnow
- Department of Computer Science, The University of Texas at Austin, Austin, TX, USA
| |
Collapse
|
25
|
Kupczok A. Split-based computation of majority-rule supertrees. BMC Evol Biol 2011; 11:205. [PMID: 21752249 PMCID: PMC3169514 DOI: 10.1186/1471-2148-11-205] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2010] [Accepted: 07/13/2011] [Indexed: 12/02/2022] Open
Abstract
Background Supertree methods combine overlapping input trees into a larger supertree. Here, I consider split-based supertree methods that first extract the split information of the input trees and subsequently combine this split information into a phylogeny. Well known split-based supertree methods are matrix representation with parsimony and matrix representation with compatibility. Combining input trees on the same taxon set, as in the consensus setting, is a well-studied task and it is thus desirable to generalize consensus methods to supertree methods. Results Here, three variants of majority-rule (MR) supertrees that generalize majority-rule consensus trees are investigated. I provide simple formulas for computing the respective score for bifurcating input- and supertrees. These score computations, together with a heuristic tree search minmizing the scores, were implemented in the python program PluMiST (Plus- and Minus SuperTrees) available from http://www.cibiv.at/software/plumist. The different MR methods were tested by simulation and on real data sets. The search heuristic was successful in combining compatible input trees. When combining incompatible input trees, especially one variant, MR(-) supertrees, performed well. Conclusions The presented framework allows for an efficient score computation of three majority-rule supertree variants and input trees. I combined the score computation with a heuristic search over the supertree space. The implementation was tested by simulation and on real data sets and showed promising results. Especially the MR(-) variant seems to be a reasonable score for supertree reconstruction. Generalizing these computations to multifurcating trees is an open problem, which may be tackled using this framework.
Collapse
Affiliation(s)
- Anne Kupczok
- Center for Integrative Bioinformatics Vienna, Max F, Perutz Laboratories, University of Vienna, Medical University of Vienna, University of Veterinary Medicine Vienna, Dr. Bohr-Gasse 9, A-1030 Vienna, Austria.
| |
Collapse
|
26
|
Kupczok A. Consequences of different null models on the tree shape bias of supertree methods. Syst Biol 2011; 60:218-25. [PMID: 21252387 DOI: 10.1093/sysbio/syq086] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Anne Kupczok
- Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, University of Vienna, Medical University of Vienna, University of Veterinary Medicine Vienna, Dr. Bohr-Gasse 9, A-1030 Vienna, Austria.
| |
Collapse
|
27
|
Affiliation(s)
- F R McMorris
- Department of Applied Mathematics, Illinois Institute of Technology, Chicago, IL 60616-3793, USA.
| | | |
Collapse
|
28
|
Buerki S, Forest F, Salamin N, Alvarez N. Comparative performance of supertree algorithms in large data sets using the soapberry family (Sapindaceae) as a case study. Syst Biol 2010; 60:32-44. [PMID: 21068445 DOI: 10.1093/sysbio/syq057] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
For the last 2 decades, supertree reconstruction has been an active field of research and has seen the development of a large number of major algorithms. Because of the growing popularity of the supertree methods, it has become necessary to evaluate the performance of these algorithms to determine which are the best options (especially with regard to the supermatrix approach that is widely used). In this study, seven of the most commonly used supertree methods are investigated by using a large empirical data set (in terms of number of taxa and molecular markers) from the worldwide flowering plant family Sapindaceae. Supertree methods were evaluated using several criteria: similarity of the supertrees with the input trees, similarity between the supertrees and the total evidence tree, level of resolution of the supertree and computational time required by the algorithm. Additional analyses were also conducted on a reduced data set to test if the performance levels were affected by the heuristic searches rather than the algorithms themselves. Based on our results, two main groups of supertree methods were identified: on one hand, the matrix representation with parsimony (MRP), MinFlip, and MinCut methods performed well according to our criteria, whereas the average consensus, split fit, and most similar supertree methods showed a poorer performance or at least did not behave the same way as the total evidence tree. Results for the super distance matrix, that is, the most recent approach tested here, were promising with at least one derived method performing as well as MRP, MinFlip, and MinCut. The output of each method was only slightly improved when applied to the reduced data set, suggesting a correct behavior of the heuristic searches and a relatively low sensitivity of the algorithms to data set sizes and missing data. Results also showed that the MRP analyses could reach a high level of quality even when using a simple heuristic search strategy, with the exception of MRP with Purvis coding scheme and reversible parsimony. The future of supertrees lies in the implementation of a standardized heuristic search for all methods and the increase in computing power to handle large data sets. The latter would prove to be particularly useful for promising approaches such as the maximum quartet fit method that yet requires substantial computing power.
Collapse
Affiliation(s)
- Sven Buerki
- Real Jardin Botanico, Department of Biodiversity and Conservation, CSIC, Plaza de Murillo 2, 28014 Madrid, Spain.
| | | | | | | |
Collapse
|
29
|
Liu L, Yu L, Edwards SV. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol Biol 2010; 10:302. [PMID: 20937096 PMCID: PMC2976751 DOI: 10.1186/1471-2148-10-302] [Citation(s) in RCA: 403] [Impact Index Per Article: 28.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2009] [Accepted: 10/11/2010] [Indexed: 12/01/2022] Open
Abstract
Background Several phylogenetic approaches have been developed to estimate species trees from collections of gene trees. However, maximum likelihood approaches for estimating species trees under the coalescent model are limited. Although the likelihood of a species tree under the multispecies coalescent model has already been derived by Rannala and Yang, it can be shown that the maximum likelihood estimate (MLE) of the species tree (topology, branch lengths, and population sizes) from gene trees under this formula does not exist. In this paper, we develop a pseudo-likelihood function of the species tree to obtain maximum pseudo-likelihood estimates (MPE) of species trees, with branch lengths of the species tree in coalescent units. Results We show that the MPE of the species tree is statistically consistent as the number M of genes goes to infinity. In addition, the probability that the MPE of the species tree matches the true species tree converges to 1 at rate O(M -1). The simulation results confirm that the maximum pseudo-likelihood approach is statistically consistent even when the species tree is in the anomaly zone. We applied our method, Maximum Pseudo-likelihood for Estimating Species Trees (MP-EST) to a mammal dataset. The four major clades found in the MP-EST tree are consistent with those in the Bayesian concatenation tree. The bootstrap supports for the species tree estimated by the MP-EST method are more reasonable than the posterior probability supports given by the Bayesian concatenation method in reflecting the level of uncertainty in gene trees and controversies over the relationship of four major groups of placental mammals. Conclusions MP-EST can consistently estimate the topology and branch lengths (in coalescent units) of the species tree. Although the pseudo-likelihood is derived from coalescent theory, and assumes no gene flow or horizontal gene transfer (HGT), the MP-EST method is robust to a small amount of HGT in the dataset. In addition, increasing the number of genes does not increase the computational time substantially. The MP-EST method is fast for analyzing datasets that involve a large number of genes but a moderate number of species.
Collapse
Affiliation(s)
- Liang Liu
- Department of Agriculture and Natural Resources, Delaware State University, Dover, DE 19901, USA.
| | | | | |
Collapse
|
30
|
Dong J, Fernández-Baca D, McMorris FR, Powers RC. Majority-rule (+) consensus trees. Math Biosci 2010; 228:10-5. [PMID: 20708021 DOI: 10.1016/j.mbs.2010.08.002] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2010] [Revised: 08/02/2010] [Accepted: 08/04/2010] [Indexed: 11/30/2022]
Abstract
The construction of a consensus tree to summarize the information of a given set of phylogenetic trees is now routinely a part of many studies in systematic biology. One popular method is the majority-rule consensus tree. In this paper we introduce and characterize a new consensus method that refines the majority-rule tree by adding certain compatible clusters satisfying a simple criterion.
Collapse
Affiliation(s)
- Jianrong Dong
- Department of Computer Science, Iowa State University, Ames, IA 50011, USA
| | | | | | | |
Collapse
|
31
|
Abstract
MOTIVATION Phylogenetic tree-building methods use molecular data to represent the evolutionary history of genes and taxa. A recurrent problem is to reconcile the various phylogenies built from different genomic sequences into a single one. This task is generally conducted by a two-step approach whereby a binary representation of the initial trees is first inferred and then a maximum parsimony (MP) analysis is performed on it. This binary representation uses a decomposition of all source trees that is usually based on clades, but that can also be based on triplets or quartets. The relative performances of these representations have been discussed but are difficult to assess since both are limited to relatively small datasets. RESULTS This article focuses on the triplet-based representation of source trees. We first recall how, using this representation, the parsimony analysis is related to the median tree notion. We then introduce SuperTriplets, a new algorithm that is specially designed to optimize this alternative formulation of the MP criterion. The method avoids several practical limitations of the triplet-based binary matrix representation, making it useful to deal with large datasets. When the correct resolution of every triplet appears more often than the incorrect ones in source trees, SuperTriplets warrants to reconstruct the correct phylogeny. Both simulations and a case study on mammalian phylogenomics confirm the advantages of this approach. In both cases, SuperTriplets tends to propose less resolved but more reliable supertrees than those inferred using M(atrix) Representation with Parsimony. AVAILABILITY Online and JAVA standalone versions of SuperTriplets are available at http://www.supertriplets.univ-montp2.fr/.
Collapse
Affiliation(s)
- Vincent Ranwez
- Université Montpellier 2, CC064, Place Eugène Bataillon, 34 095 Montpellier Cedex 05, France.
| | | | | |
Collapse
|
32
|
Bansal MS, Burleigh JG, Eulenstein O, Fernández-Baca D. Robinson-Foulds supertrees. Algorithms Mol Biol 2010; 5:18. [PMID: 20181274 PMCID: PMC2846952 DOI: 10.1186/1748-7188-5-18] [Citation(s) in RCA: 80] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2009] [Accepted: 02/24/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Supertree methods synthesize collections of small phylogenetic trees with incomplete taxon overlap into comprehensive trees, or supertrees, that include all taxa found in the input trees. Supertree methods based on the well established Robinson-Foulds (RF) distance have the potential to build supertrees that retain much information from the input trees. Specifically, the RF supertree problem seeks a binary supertree that minimizes the sum of the RF distances from the supertree to the input trees. Thus, an RF supertree is a supertree that is consistent with the largest number of clusters (or clades) from the input trees. RESULTS We introduce efficient, local search based, hill-climbing heuristics for the intrinsically hard RF supertree problem on rooted trees. These heuristics use novel non-trivial algorithms for the SPR and TBR local search problems which improve on the time complexity of the best known (naïve) solutions by a factor of Theta(n) and Theta(n2) respectively (where n is the number of taxa, or leaves, in the supertree). We use an implementation of our new algorithms to examine the performance of the RF supertree method and compare it to matrix representation with parsimony (MRP) and the triplet supertree method using four supertree data sets. Not only did our RF heuristic provide fast estimates of RF supertrees in all data sets, but the RF supertrees also retained more of the information from the input trees (based on the RF distance) than the other supertree methods. CONCLUSIONS Our heuristics for the RF supertree problem, based on our new local search algorithms, make it possible for the first time to estimate large supertrees by directly optimizing the RF distance from rooted input trees to the supertrees. This provides a new and fast method to build accurate supertrees. RF supertrees may also be useful for estimating majority-rule(-) supertrees, which are a generalization of majority-rule consensus trees.
Collapse
Affiliation(s)
- Mukul S Bansal
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel
- Department of Computer Science, Iowa State University, Ames, IA 50011, USA
| | - J Gordon Burleigh
- Department of Biology, University of Florida, Gainesville, FL 32611, USA
| | - Oliver Eulenstein
- Department of Computer Science, Iowa State University, Ames, IA 50011, USA
| | | |
Collapse
|
33
|
Dong J, Fernández-Baca D, McMorris FR. Constructing majority-rule supertrees. Algorithms Mol Biol 2010; 5:2. [PMID: 20047658 PMCID: PMC2826330 DOI: 10.1186/1748-7188-5-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2009] [Accepted: 01/04/2010] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND Supertree methods combine the phylogenetic information from multiple partially-overlapping trees into a larger phylogenetic tree called a supertree. Several supertree construction methods have been proposed to date, but most of these are not designed with any specific properties in mind. Recently, Cotton and Wilkinson proposed extensions of the majority-rule consensus tree method to the supertree setting that inherit many of the appealing properties of the former. RESULTS We study a variant of one of Cotton and Wilkinson's methods, called majority-rule (+) supertrees. After proving that a key underlying problem for constructing majority-rule (+) supertrees is NP-hard, we develop a polynomial-size exact integer linear programming formulation of the problem. We then present a data reduction heuristic that identifies smaller subproblems that can be solved independently. While this technique is not guaranteed to produce optimal solutions, it can achieve substantial problem-size reduction. Finally, we report on a computational study of our approach on various real data sets, including the 121-taxon, 7-tree Seabirds data set of Kennedy and Page. CONCLUSIONS The results indicate that our exact method is computationally feasible for moderately large inputs. For larger inputs, our data reduction heuristic makes it feasible to tackle problems that are well beyond the range of the basic integer programming approach. Comparisons between the results obtained by our heuristic and exact solutions indicate that the heuristic produces good answers. Our results also suggest that the majority-rule (+) approach, in both its basic form and with data reduction, yields biologically meaningful phylogenies.
Collapse
Affiliation(s)
- Jianrong Dong
- Department of Computer Science, Iowa State University, Ames, IA 50011, USA
| | | | - FR McMorris
- Department of Applied Mathematics, Illinois Institute of Technology, Chicago, IL 60616, USA
| |
Collapse
|
34
|
Werth MT, Halouska S, Shortridge MD, Zhang B, Powers R. Analysis of metabolomic PCA data using tree diagrams. Anal Biochem 2009; 399:58-63. [PMID: 20026297 DOI: 10.1016/j.ab.2009.12.022] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2009] [Revised: 12/14/2009] [Accepted: 12/14/2009] [Indexed: 01/05/2023]
Abstract
Large amounts of data from high-throughput metabolomic experiments are commonly visualized using a principal component analysis (PCA) two-dimensional scores plot. The question of the similarity or difference between multiple metabolic states then becomes a question of the degree of overlap between their respective data point clusters in principal component (PC) scores space. A qualitative visual inspection of the clustering pattern in PCA scores plots is a common protocol. This article describes the application of tree diagrams and bootstrapping techniques for an improved quantitative analysis of metabolic PCA data clustering. Our PCAtoTree program creates a distance matrix with 100 bootstrap steps that describes the separation of all clusters in a metabolic data set. Using accepted phylogenetic software, the distance matrix resulting from the various metabolic states is organized into a phylogenetic-like tree format, where bootstrap values 50 indicate a statistically relevant branch separation. PCAtoTree analysis of two previously published data sets demonstrates the improved resolution of metabolic state differences using tree diagrams. In addition, for metabolomic studies of large numbers of different metabolic states, the tree format provides a better description of similarities and differences between each metabolic state. The approach is also tolerant of sample size variations between different metabolic states.
Collapse
Affiliation(s)
- Mark T Werth
- Department of Chemistry, Nebraska Wesleyan University, Lincoln, NE 68504, USA
| | | | | | | | | |
Collapse
|
35
|
Douglas ME, Douglas MR, Schuett GW, Beck DD, Sullivan BK. Conservation phylogenetics of helodermatid lizards using multiple molecular markers and a supertree approach. Mol Phylogenet Evol 2009; 55:153-167. [PMID: 20006722 DOI: 10.1016/j.ympev.2009.12.009] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2009] [Revised: 12/06/2009] [Accepted: 12/07/2009] [Indexed: 11/13/2022]
Abstract
We analyzed both mitochondrial (mt-) and nuclear (n) DNAs in a conservation phylogenetic framework to examine deep and shallow histories of the Beaded Lizard (Heloderma horridum) and Gila Monster (H. suspectum) throughout their geographic ranges in North and Central America. Both mtDNA and intron markers clearly partitioned each species. One intron and mtDNA further subdivided H. horridum into its four recognized subspecies (H. n. alvarezi, charlesbogerti,exasperatum, and horridum). However, the two subspecies of H. suspectum (H. s. suspectum and H. s. cinctum) were undefined. A supertree approach sustained these relationships. Overall, the Helodermatidae is reaffirmed as an ancient and conserved group. Its most recent common ancestor (MRCA) was Lower Eocene [35.4 million years ago (mya)], with a approximately 25 my period of stasis before the MRCA of H. horridum diversified in Lower Miocene. Another approximately 5 my passed before H. h. exasperatum and H. h. horridum diverged, followed by approximately 1.5 my before H. h. alvarezi and H. h. charlesbogerti separated. Heloderma suspectum reflects an even longer period of stasis (approximately 30 my) before diversifying from its MRCA. Both H. suspectum (México) and H. h. alvarezi also revealed evidence of historic range expansion following a recent bottleneck. Our conservation phylogenetic approach emphasizes the origin and diversification of this group, yields information on the manner by which past environmental variance may have impacted its populations and, in turn, allows us to disentangle historic from contemporary impacts that might threaten its long-term persistence. The value of helodermatid conservation resides in natural services and medicinal products, particularly venom constituents, and these are only now being realized.
Collapse
Affiliation(s)
- Michael E Douglas
- Illinois Natural History Survey, Institute for Natural Resource Sustainability, University of Illinois, Champaign, IL 61820, USA.
| | - Marlis R Douglas
- Illinois Natural History Survey, Institute for Natural Resource Sustainability, University of Illinois, Champaign, IL 61820, USA
| | - Gordon W Schuett
- Department of Biology and Center for Behavioral Neuroscience, Georgia State University, Atlanta, GA 30303-3088, USA
| | - Daniel D Beck
- Department of Biological Sciences, Central Washington University, Ellensburg, WA 98926, USA
| | - Brian K Sullivan
- Division of Mathematics & Natural Sciences, Arizona State University, Phoenix, AZ 85069, USA
| |
Collapse
|
36
|
Gaubert P, Denys G, Oberdorff T. Genus-level supertree of Cyprinidae (Actinopterygii: Cypriniformes), partitioned qualitative clade support and test of macro-evolutionary scenarios. Biol Rev Camb Philos Soc 2009; 84:653-89. [DOI: 10.1111/j.1469-185x.2009.00091.x] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
37
|
Affiliation(s)
- Jianrong Dong
- Department of Computer Science, Iowa State University, Ames, IA 50011, USA
| | | |
Collapse
|
38
|
Torices R, Anderberg AA. Phylogenetic analysis of sexual systems in Inuleae (Asteraceae). AMERICAN JOURNAL OF BOTANY 2009; 96:1011-1019. [PMID: 21628252 DOI: 10.3732/ajb.0800231] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
From an ancestor with bisexual flowers, plants with unisexual flowers, or even unisexual individuals have evolved in different lineages of angiosperms. The Asteraceae tribe Inuleae includes hermaphroditic, monoecious, dioecious, and gynomonoecious species. Gynomonoecy, the sexual system in which female and bisexual flowers occur on the same plant, is prevalent in the Asteraceae. We inferred one large gene phylogeny (ndhF) and two supertrees to investigate whether gynomonoecy was a stage in the evolution from hermaphroditism to monoecy. We identified transitions in sexual system evolution using the stochastic character mapping method. From gynomonoecious ancestors, both hermaphroditic and monoecious descendants have evolved. Gynomonoecy was not restricted to a stage in the evolution toward monoecy because the number of transitions and the rate of change from monoecy to gynomonoecy were much higher than the opposite. We also investigated one hypothesized association among female flowers and the development of a petaloid ray as an explanation of gynomonoecy maintenance in Asteraceae. We found that peripheral female flowers and petaloid rays were phylogenetically correlated. However, empirical evidence shows that a causal relationship between these traits is not clear.
Collapse
Affiliation(s)
- Rubén Torices
- Área de Biodiversidad y Conservación, Universidad Rey Juan Carlos, E-28933 Móstoles, Spain
| | | |
Collapse
|
39
|
Supertrees join the mainstream of phylogenetics. Trends Ecol Evol 2009; 24:1-3. [DOI: 10.1016/j.tree.2008.08.006] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2008] [Revised: 08/12/2008] [Accepted: 08/26/2008] [Indexed: 11/20/2022]
|
40
|
Abstract
We analyze a maximum likelihood approach for combining phylogenetic trees into a larger "supertree." This is based on a simple exponential model of phylogenetic error, which ensures that ML supertrees have a simple combinatorial description (as a median tree, minimizing a weighted sum of distances to the input trees). We show that this approach to ML supertree reconstruction is statistically consistent (it converges on the true species supertree as more input trees are combined), in contrast to the widely used MRP method, which we show can be statistically inconsistent under the exponential error model. We also show that this statistical consistency extends to an ML approach for constructing species supertrees from gene trees. In this setting, incomplete lineage sorting (due to coalescence rates of homologous genes being lower than speciation rates) has been shown to lead to gene trees that are frequently different from species trees, and this can confound efforts to reconstruct the species phylogeny correctly.
Collapse
Affiliation(s)
- Mike Steel
- Allan Wilson Centre for Molecular Ecology and Evolution, Department of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand.
| | | |
Collapse
|
41
|
Abstract
The parsimony score of a character on a tree equals the number of state changes required to fit that character onto the tree. We show that for unordered, reversible characters this score equals the number of tree rearrangements required to fit the tree onto the character. We discuss implications of this connection for the debate over the use of consensus trees or total evidence and show how it provides a link between incongruence of characters and recombination.
Collapse
Affiliation(s)
- Trevor C Bruen
- Department of Mathematics, University of California, Berkeley, USA.
| | | |
Collapse
|