1
|
Abstract
In the tree reconciliation approach for species tree inference, a tree that has the minimum reconciliation score for given gene trees is taken as an estimate of the species tree. The scoring models used in existing tree reconciliation methods include the duplication, mutation, and deep coalescence costs. Since existing inference methods all are heuristic, their performances are often evaluated by using the Robinson-Foulds (RF) distance between the true species trees and the estimates output on simulated multi-locus datasets. To better understand these methods, we study the relationships between the duplication cost and the RF distance. We prove that the gap between the duplication cost and the RF distance is unbounded, but the symmetric duplication cost is logarithmically equivalent to the RF distance. The relationships between other reconciliation costs and the RF distance are also investigated.
Collapse
Affiliation(s)
- Yu Zheng
- Department of Mathematics, National University of Singapore, Singapore
| | - Louxin Zhang
- Department of Mathematics, National University of Singapore, Singapore
- Graduate School for Integrative Sciences and Engineering, National University of Singapore, Singapore
| |
Collapse
|
2
|
Wehe A, Burleigh JG, Eulenstein O. Efficient algorithms for knowledge-enhanced supertree and supermatrix phylogenetic problems. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:1432-1441. [PMID: 24407302 DOI: 10.1109/tcbb.2012.162] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Phylogenetic inference is a computationally difficult problem, and constructing high-quality phylogenies that can build upon existing phylogenetic knowledge and synthesize insights from new data remains a major challenge. We introduce knowledge-enhanced phylogenetic problems for both supertree and supermatrix phylogenetic analyses. These problems seek an optimal phylogenetic tree that can only be assembled from a user-supplied set of, possibly incompatible, phylogenetic relationships. We describe exact polynomial time algorithms for the knowledge-enhanced versions of the NP-hard Robinson Foulds, gene duplication, duplication and loss, and deep coalescence supertree problems. Further, we demonstrate that our algorithms can rapidly improve upon results of local search heuristics for these problems. Finally, we introduce a knowledge-enhanced search heuristic that can be applied to any discrete character data set using the maximum parsimony (MP) phylogenetic problem. Although this approach is not guaranteed to find exact solutions, we show that it also can improve upon solutions from commonly used MP heuristics.
Collapse
Affiliation(s)
- André Wehe
- University of Florida, Gainesville and Iowa State University, Ames
| | | | | |
Collapse
|
3
|
Chang WC, Górecki P, Eulenstein O. Exact solutions for species tree inference from discordant gene trees. J Bioinform Comput Biol 2013; 11:1342005. [PMID: 24131054 DOI: 10.1142/s0219720013420055] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Phylogenetic analysis has to overcome the grant challenge of inferring accurate species trees from evolutionary histories of gene families (gene trees) that are discordant with the species tree along whose branches they have evolved. Two well studied approaches to cope with this challenge are to solve either biologically informed gene tree parsimony (GTP) problems under gene duplication, gene loss, and deep coalescence, or the classic RF supertree problem that does not rely on any biological model. Despite the potential of these problems to infer credible species trees, they are NP-hard. Therefore, these problems are addressed by heuristics that typically lack any provable accuracy and precision. We describe fast dynamic programming algorithms that solve the GTP problems and the RF supertree problem exactly, and demonstrate that our algorithms can solve instances with data sets consisting of as many as 22 taxa. Extensions of our algorithms can also report the number of all optimal species trees, as well as the trees themselves. To better asses the quality of the resulting species trees that best fit the given gene trees, we also compute the worst case species trees, their numbers, and optimization score for each of the computational problems. Finally, we demonstrate the performance of our exact algorithms using empirical and simulated data sets, and analyze the quality of heuristic solutions for the studied problems by contrasting them with our exact solutions.
Collapse
|
4
|
Bansal MS, Eulenstein O. Algorithms for genome-scale phylogenetics using gene tree parsimony. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:939-956. [PMID: 24334388 DOI: 10.1109/tcbb.2013.103] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
The use of genomic data sets for phylogenetics is complicated by the fact that evolutionary processes such as gene duplication and loss, or incomplete lineage sorting (deep coalescence) cause incongruence among gene trees. One well-known approach that deals with this complication is gene tree parsimony, which, given a collection of gene trees, seeks a species tree that requires the smallest number of evolutionary events to explain the incongruence of the gene trees. However, a lack of efficient algorithms has limited the use of this approach. Here, we present efficient algorithms for SPR and TBR-based local search heuristics for gene tree parsimony under the 1) duplication, 2) loss, 3) duplication-loss, and 4) deep coalescence reconciliation costs. These novel algorithms improve upon the time complexities of previous algorithms for these problems by a factor of n, where n is the number of species in the collection of gene trees. Our algorithms provide a substantial improvement in runtime and scalability compared to previous implementations and enable large-scale gene tree parsimony analyses using any of the four reconciliation costs. Our algorithms have been implemented in the software packages DupTree and iGTP, and have already been used to perform several compelling phylogenetic studies.
Collapse
|
5
|
Doyon JP, Hamel S, Chauve C. An efficient method for exploring the space of gene tree/species tree reconciliations in a probabilistic framework. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:26-39. [PMID: 21464510 DOI: 10.1109/tcbb.2011.64] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
BACKGROUND Inferring an evolutionary scenario for a gene family is a fundamental problem with applications both in functional and evolutionary genomics. The gene tree/species tree reconciliation approach has been widely used to address this problem, but mostly in a discrete parsimony framework that aims at minimizing the number of gene duplications and/or gene losses. Recently, a probabilistic approach has been developed, based on the classical birth-and-death process, including efficient algorithms for computing posterior probabilities of reconciliations and orthology prediction. RESULTS In previous work, we described an algorithm for exploring the whole space of gene tree/species tree reconciliations, that we adapt here to compute efficiently the posterior probability of such reconciliations. These posterior probabilities can be either computed exactly or approximated, depending on the reconciliation space size. We use this algorithm to analyze the probabilistic landscape of the space of reconciliations for a real data set of fungal gene families and several data sets of synthetic gene trees. CONCLUSION The results of our simulations suggest that, with exact gene trees obtained by a simple birth-and-death process and realistic gene duplication/loss rates, a very small subset of all reconciliations needs to be explored in order to approximate very closely the posterior probability of the most likely reconciliations. For cases where the posterior probability mass is more evenly dispersed, our method allows to explore efficiently the required subspace of reconciliations.
Collapse
|
6
|
Bansal MS, Shamir R. A Note on the Fixed Parameter Tractability of the Gene-Duplication Problem. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:848-850. [PMID: 20733245 DOI: 10.1109/tcbb.2010.74] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
The NP-hard gene-duplication problem takes as input a collection of gene trees and seeks a species tree that requires the fewest number of gene duplications to reconcile the input gene trees. An oft-cited, decade-old result by Stege states that the gene-duplication problem is fixed parameter tractable when parameterized by the number of gene duplications necessary for the reconciliation. Here, we uncover an error in this fixed parameter algorithm and show that this error cannot be corrected without sacrificing the fixed parameter tractability of the algorithm. Furthermore, we show a link between the gene-duplication problem and the minimum rooted triplets inconsistency problem which implies that the gene-duplication problem is 1) W[2]-hard when parameterized by the number of gene duplications necessary for the reconciliation and 2) hard to approximate to better than a logarithmic factor.
Collapse
Affiliation(s)
- Mukul S Bansal
- School of Computer Science, Schreiber Bldg., Tel-Aviv University, Tel Aviv 69978, Israel.
| | | |
Collapse
|
7
|
Branch-and-bound approach for parsimonious inference of a species tree from a set of gene family trees. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2011; 696:287-95. [PMID: 21431569 DOI: 10.1007/978-1-4419-7046-6_29] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
We describe a Branch-and-Bound algorithm for computing a parsimonious species tree, given a set of gene family trees. Our algorithm can consider three cost measures: number of gene duplications, number of gene losses, and both combined. Moreover, to cope with intrinsic limitations of Branch-and-Bound algorithms for species trees inference regarding the number of taxa that can be considered, our algorithm can naturally take into account predefined relationships between sets of taxa. We test our algorithm on a dataset of eukaryotic gene families spanning 29 taxa.
Collapse
|
8
|
Bansal MS, Burleigh JG, Eulenstein O, Fernández-Baca D. Robinson-Foulds supertrees. Algorithms Mol Biol 2010; 5:18. [PMID: 20181274 PMCID: PMC2846952 DOI: 10.1186/1748-7188-5-18] [Citation(s) in RCA: 80] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2009] [Accepted: 02/24/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Supertree methods synthesize collections of small phylogenetic trees with incomplete taxon overlap into comprehensive trees, or supertrees, that include all taxa found in the input trees. Supertree methods based on the well established Robinson-Foulds (RF) distance have the potential to build supertrees that retain much information from the input trees. Specifically, the RF supertree problem seeks a binary supertree that minimizes the sum of the RF distances from the supertree to the input trees. Thus, an RF supertree is a supertree that is consistent with the largest number of clusters (or clades) from the input trees. RESULTS We introduce efficient, local search based, hill-climbing heuristics for the intrinsically hard RF supertree problem on rooted trees. These heuristics use novel non-trivial algorithms for the SPR and TBR local search problems which improve on the time complexity of the best known (naïve) solutions by a factor of Theta(n) and Theta(n2) respectively (where n is the number of taxa, or leaves, in the supertree). We use an implementation of our new algorithms to examine the performance of the RF supertree method and compare it to matrix representation with parsimony (MRP) and the triplet supertree method using four supertree data sets. Not only did our RF heuristic provide fast estimates of RF supertrees in all data sets, but the RF supertrees also retained more of the information from the input trees (based on the RF distance) than the other supertree methods. CONCLUSIONS Our heuristics for the RF supertree problem, based on our new local search algorithms, make it possible for the first time to estimate large supertrees by directly optimizing the RF distance from rooted input trees to the supertrees. This provides a new and fast method to build accurate supertrees. RF supertrees may also be useful for estimating majority-rule(-) supertrees, which are a generalization of majority-rule consensus trees.
Collapse
Affiliation(s)
- Mukul S Bansal
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel
- Department of Computer Science, Iowa State University, Ames, IA 50011, USA
| | - J Gordon Burleigh
- Department of Biology, University of Florida, Gainesville, FL 32611, USA
| | - Oliver Eulenstein
- Department of Computer Science, Iowa State University, Ames, IA 50011, USA
| | | |
Collapse
|
9
|
Bansal MS, Burleigh JG, Eulenstein O. Efficient genome-scale phylogenetic analysis under the duplication-loss and deep coalescence cost models. BMC Bioinformatics 2010; 11 Suppl 1:S42. [PMID: 20122216 PMCID: PMC3009515 DOI: 10.1186/1471-2105-11-s1-s42] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022] Open
Abstract
Background Genomic data provide a wealth of new information for phylogenetic analysis. Yet making use of this data requires phylogenetic methods that can efficiently analyze extremely large data sets and account for processes of gene evolution, such as gene duplication and loss, incomplete lineage sorting (deep coalescence), or horizontal gene transfer, that cause incongruence among gene trees. One such approach is gene tree parsimony, which, given a set of gene trees, seeks a species tree that requires the smallest number of evolutionary events to explain the incongruence of the gene trees. However, the only existing algorithms for gene tree parsimony under the duplication-loss or deep coalescence reconciliation cost are prohibitively slow for large datasets. Results We describe novel algorithms for SPR and TBR based local search heuristics under the duplication-loss cost, and we show how they can be adapted for the deep coalescence cost. These algorithms improve upon the best existing algorithms for these problems by a factor of n, where n is the number of species in the collection of gene trees. We implemented our new SPR based local search algorithm for the duplication-loss cost and demonstrate the tremendous improvement in runtime and scalability it provides compared to existing implementations. We also evaluate the performance of our algorithm on three large-scale genomic data sets. Conclusion Our new algorithms enable, for the first time, gene tree parsimony analyses of thousands of genes from hundreds of taxa using the duplication-loss and deep coalescence reconciliation costs. Thus, this work expands both the size of data sets and the range of evolutionary models that can be incorporated into genome-scale phylogenetic analyses.
Collapse
Affiliation(s)
- Mukul S Bansal
- School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel.
| | | | | |
Collapse
|
10
|
Abstract
BACKGROUND There is much interest in developing fast and accurate supertree methods to infer the tree of life. Supertree methods combine smaller input trees with overlapping sets of taxa to make a comprehensive phylogenetic tree that contains all of the taxa in the input trees. The intrinsically hard triplet supertree problem takes a collection of input species trees and seeks a species tree (supertree) that maximizes the number of triplet subtrees that it shares with the input trees. However, the utility of this supertree problem has been limited by a lack of efficient and effective heuristics. RESULTS We introduce fast hill-climbing heuristics for the triplet supertree problem that perform a step-wise search of the tree space, where each step is guided by an exact solution to an instance of a local search problem. To realize time efficient heuristics we designed the first nontrivial algorithms for two standard search problems, which greatly improve on the time complexity to the best known (naïve) solutions by a factor of n and n2 (the number of taxa in the supertree). These algorithms enable large-scale supertree analyses based on the triplet supertree problem that were previously not possible. We implemented hill-climbing heuristics that are based on our new algorithms, and in analyses of two published supertree data sets, we demonstrate that our new heuristics outperform other standard supertree methods in maximizing the number of triplets shared with the input trees. CONCLUSION With our new heuristics, the triplet supertree problem is now computationally more tractable for large-scale supertree analyses, and it provides a potentially more accurate alternative to existing supertree methods.
Collapse
Affiliation(s)
- Harris T Lin
- Department of Computer Science, Iowa State University, Ames, IA, USA.
| | | | | |
Collapse
|