1
|
Wu Y, Zhang L. Computing the Bounds of the Number of Reticulations in a Tree-Child Network That Displays a Set of Trees. J Comput Biol 2024; 31:345-359. [PMID: 38285528 DOI: 10.1089/cmb.2023.0309] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2024] Open
Abstract
Phylogenetic network is an evolutionary model that uses a rooted directed acyclic graph (instead of a tree) to model an evolutionary history of species in which reticulate events (e.g., hybrid speciation or horizontal gene transfer) occurred. Tree-child network is a kind of phylogenetic network with structural constraints. Existing approaches for tree-child network reconstruction can be slow for large data. In this study, we present several computational approaches for bounding from below the number of reticulations in a tree-child network that displays a given set of rooted binary phylogenetic trees. In addition, we also present some theoretical results on bounding from above the number of reticulations. Through simulation, we demonstrate that the new lower bounds on the reticulation number for tree-child networks can practically be computed for large tree data. The bounds can provide estimates of reticulation for relatively large data.
Collapse
Affiliation(s)
- Yufeng Wu
- Department of Computer Science and Engineering, University of Connecticut, Storrs, Connecticut, USA
| | - Louxin Zhang
- Department of Mathematics, Center for Data Science and Machine Learning, National University of Singapore, Singapore, Singapore
| |
Collapse
|
2
|
Olver N, Schalekamp F, van der Ster S, Stougie L, van Zuylen A. A duality based 2-approximation algorithm for maximum agreement forest. MATHEMATICAL PROGRAMMING 2022; 198:811-853. [PMID: 36845754 PMCID: PMC9945189 DOI: 10.1007/s10107-022-01790-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/01/2018] [Accepted: 02/14/2022] [Indexed: 06/18/2023]
Abstract
We give a 2-approximation algorithm for the Maximum Agreement Forest problem on two rooted binary trees. This NP-hard problem has been studied extensively in the past two decades, since it can be used to compute the rooted Subtree Prune-and-Regraft (rSPR) distance between two phylogenetic trees. Our algorithm is combinatorial and its running time is quadratic in the input size. To prove the approximation guarantee, we construct a feasible dual solution for a novel exponential-size linear programming formulation. In addition, we show this linear program has a smaller integrality gap than previously known formulations, and we give an equivalent compact formulation, showing that it can be solved in polynomial time.
Collapse
Affiliation(s)
- Neil Olver
- Department of Mathematics, London School of Economics and Political Science, London, United Kingdom
- Centrum Wiskunde & Informatica, Amsterdam, Netherlands
| | - Frans Schalekamp
- School of Operations Research and Information Engineering, Cornell University, Ithaca, USA
| | | | - Leen Stougie
- Centrum Wiskunde & Informatica, Amsterdam, Netherlands
- Department of Operations Analytics, Vrije Universiteit Amsterdam, Amsterdam, Netherlands
- INRIA-Erable, Lyon, France
| | - Anke van Zuylen
- Department of Computer Science, Cornell University, Ithaca, USA
| |
Collapse
|
3
|
Poormohammadi H, Sardari Zarchi M. Netcombin: An algorithm for constructing optimal phylogenetic network from rooted triplets. PLoS One 2020; 15:e0227842. [PMID: 32947609 PMCID: PMC7500971 DOI: 10.1371/journal.pone.0227842] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2019] [Accepted: 09/08/2020] [Indexed: 11/18/2022] Open
Abstract
Phylogenetic networks construction is one the most important challenge in phylogenetics. These networks can present complex non-treelike events such as gene flow, horizontal gene transfers, recombination or hybridizations. Among phylogenetic networks, rooted structures are commonly used to represent the evolutionary history of a species set, explicitly. Triplets are well known input for constructing the rooted networks. Obtaining an optimal rooted network that contains all given triplets is main problem in network construction. The optimality criteria include minimizing the level or the number of reticulation nodes. The complexity of this problem is known to be NP-hard. In this research, a new algorithm called Netcombin is introduced to construct approximately an optimal network which is consistent with input triplets. The innovation of this algorithm is based on binarization and expanding processes. The binarization process innovatively uses a measure to construct a binary rooted tree T consistent with the approximately maximum number of input triplets. Then T is expanded using a heuristic function by adding minimum number of edges to obtain final network with the approximately minimum number of reticulation nodes. In order to evaluate the proposed algorithm, Netcombin is compared with four state of the art algorithms, RPNCH, NCHB, TripNet, and SIMPLISTIC. The experimental results on simulated data obtained from biologically generated sequences data indicate that by considering the trade-off between speed and precision, the Netcombin outperforms the others.
Collapse
Affiliation(s)
- Hadi Poormohammadi
- Department of Computer Engineering, Meybod University, Meybod, Iran
- School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
- * E-mail:
| | | |
Collapse
|
4
|
Bansal MS. Linear-time algorithms for phylogenetic tree completion under Robinson-Foulds distance. Algorithms Mol Biol 2020; 15:6. [PMID: 32313549 PMCID: PMC7155338 DOI: 10.1186/s13015-020-00166-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2020] [Accepted: 04/04/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We consider two fundamental computational problems that arise when comparing phylogenetic trees, rooted or unrooted, with non-identical leaf sets. The first problem arises when comparing two trees where the leaf set of one tree is a proper subset of the other. The second problem arises when the two trees to be compared have only partially overlapping leaf sets. The traditional approach to handling these problems is to first restrict the two trees to their common leaf set. An alternative approach that has shown promise is to first complete the trees by adding missing leaves, so that the resulting trees have identical leaf sets. This requires the computation of an optimal completion that minimizes the distance between the two resulting trees over all possible completions. RESULTS We provide optimal linear-time algorithms for both completion problems under the widely-used Robinson-Foulds (RF) distance measure. Our algorithm for the first problem improves the time complexity of the current fastest algorithm from quadratic (in the size of the two trees) to linear. No algorithms have yet been proposed for the more general second problem where both trees have missing leaves. We advance the study of this general problem by proposing a useful restricted version of the general problem and providing optimal linear-time algorithms for the restricted version. Our experimental results on biological data sets suggest that completion-based RF distances can be very different compared to traditional RF distances.
Collapse
|
5
|
Yamada K, Chen ZZ, Wang L. Improved Practical Algorithms for Rooted Subtree Prune and Regraft (rSPR) Distance and Hybridization Number. J Comput Biol 2020; 27:1422-1432. [PMID: 32048865 DOI: 10.1089/cmb.2019.0432] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The problem of computing the rooted subtree prune and regraft (rSPR) distance of two phylogenetic trees is computationally hard and so is the problem of computing the hybridization number of two phylogenetic trees (denoted by Hybridization Number Computation [HNC]). Since they are important problems in phylogenetics, they have been studied extensively in the literature. Indeed, quite a number of exact or approximation algorithms have been designed and implemented for them. In this article, we design and implement several approximation algorithms for them and one exact algorithm for HNC. Our experimental results show that the resulting exact program is much faster (namely, more than 80 times faster for the easiest dataset used in the experiments) than the previous best and its superiority in speed becomes even more significant for more difficult instances. Moreover, the resulting approximation program's output has much better results than the previous bests; indeed, the outputs are always nearly optimal and often optimal. Of particular interest is the usage of the Monte Carlo tree search (MCTS) method in the design of our approximation algorithms. Our experimental results show that with MCTS, we can often solve HNC exactly within short time.
Collapse
Affiliation(s)
- Kohei Yamada
- Division of Information System Design, Tokyo Denki University, Hatoyama, Saitama 350-0394, Japan
| | - Zhi-Zhong Chen
- Division of Information System Design, Tokyo Denki University, Hatoyama, Saitama 350-0394, Japan
| | - Lusheng Wang
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| |
Collapse
|
6
|
Chen ZZ, Harada Y, Nakamura Y, Wang L. Faster Exact Computation of rSPR Distance via Better Approximation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 17:916-929. [PMID: 30387740 DOI: 10.1109/tcbb.2018.2878731] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Due to hybridization events in evolution, studying two different genes of a set of species may yield two related but different phylogenetic trees for the set of species. In this case, we want to measure the dissimilarity of the two trees. The rooted subtree prune and regraft (rSPR) distance of the two trees has been used for this purpose. The problem of computing the rSPR distance of two given trees has many applications but is NP-hard. Accordingly, a number of programs have been developed for solving the problem either exactly or approximately. In this paper, we develop two new programs one of which solves the problem exactly and outperforms the previous best (namely, Whidden et al.'s rSPR-v1.3.0) significantly, while the other solves the problem approximately and outputs significantly better lower and upper bounds on the rSPR distance of the two given trees than the previous best due to Schalekamp et al. Our programs can be downloaded at http://rnc.r.dendai.ac.jp/rspr.html.
Collapse
|
7
|
Chen ZZ, Feng Q, Shen C, Wang J, Wang L. Algorithms for Pedigree Comparison. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:422-431. [PMID: 27076461 DOI: 10.1109/tcbb.2016.2550434] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Reconstruction of ancestral relationships among genera, species, and populations is a core task in evolutionary biology. At the population level, pedigrees have been commonly used. Reconstruction of pedigree is required in practice due to legal or medical reasons. Pedigrees are very important to geneticists for inferring haplotype segments, recombination, and allele sharing status with which disease loci can be identified. Evaluating reconstruction methods requires comparing the inferred pedigree and the known pedigrees. Moreover, comparison of pedigrees is required in studying relationships among crops such as maize, wheat and barley, etc. In this paper, we discuss three models for comparison of pedigrees, the maximum pedigree isomorphism problem, the maximum paternal-path-preserved mapping problem, and the minimum edge-cutting mapping problem. For the maximum pedigree isomorphism problem, we prove that the problem is NP-hard and give a fixed-parameter algorithm for the problem. For the maximum paternal-path-preserved mapping problem, we give a dynamic-programming algorithm to find the mapping that preserves the maximum number of paternal paths between the two input pedigrees. For the minimum edge-cutting mapping problem, we prove that the problem is NP-hard and give a fixed-parameter algorithm with running time , where is the number of vertices in the two input pedigrees and is the number of edges to be cut. This algorithm is useful in practice when comparing two similar pedigrees.
Collapse
|
8
|
Whidden C, Matsen F. Calculating the Unrooted Subtree Prune-and-Regraft Distance. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 16:898-911. [PMID: 29994585 DOI: 10.1109/tcbb.2018.2802911] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The subtree prune-and-regraft (SPR) distance metric is a fundamental way of comparing evolutionary trees. It has wide-ranging applications, such as to study lateral genetic transfer, viral recombination, and Markov chain Monte Carlo phylogenetic inference. Although the rooted version of SPR distance can be computed relatively efficiently between rooted trees using fixed-parameter-tractable maximum agreement forest (MAF) algorithms, no MAF formulation is known for the unrooted case. Correspondingly, previous algorithms are unable to compute unrooted SPR distances larger than 7.
Collapse
|
9
|
Cha IT, Cho ES, Yoo Y, Seok YJ, Park I, Lim HS, Park JM, Roh SW, Nam YD, Choi HJ, Lee YK, Seo MJ. Paenibacillus arcticus sp. nov., isolated from Arctic soil. Int J Syst Evol Microbiol 2017; 67:4385-4389. [DOI: 10.1099/ijsem.0.002299] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Affiliation(s)
- In-Tae Cha
- Division of Bioengineering, Incheon National University, Incheon 22012, Republic of Korea
| | - Eui-Sang Cho
- Department of Bioengineering and Nano-Bioengineering, Graduate School of Incheon National University, Incheon 22012, Republic of Korea
| | - Yesol Yoo
- Department of Life Sciences, Graduate School of Incheon National University, Incheon 22012, Republic of Korea
| | - Yoon Ji Seok
- Department of Life Sciences, Graduate School of Incheon National University, Incheon 22012, Republic of Korea
| | - Inhye Park
- Department of Life Sciences, Graduate School of Incheon National University, Incheon 22012, Republic of Korea
| | - Hee Seon Lim
- Department of Life Sciences, Graduate School of Incheon National University, Incheon 22012, Republic of Korea
| | - Jung-Min Park
- Korean Culture Center of Microorganisms, Seoul 03641, Republic of Korea
| | - Seong Woon Roh
- Microbiology and Functionality Research Group, World Institute of Kimchi, Gwangju 61755, Republic of Korea
| | - Young-Do Nam
- Research Group of Gut Microbiome, Korea Food Research Institute, Seongnam 13539, Republic of Korea
| | - Hak-Jong Choi
- Microbiology and Functionality Research Group, World Institute of Kimchi, Gwangju 61755, Republic of Korea
| | - Yoo Kyung Lee
- Division of Life Sciences, Korea Polar Research Institute, Incheon 21990, Republic of Korea
| | - Myung-Ji Seo
- Division of Bioengineering, Incheon National University, Incheon 22012, Republic of Korea
- Department of Bioengineering and Nano-Bioengineering, Graduate School of Incheon National University, Incheon 22012, Republic of Korea
| |
Collapse
|
10
|
Mai U, Sayyari E, Mirarab S. Minimum variance rooting of phylogenetic trees and implications for species tree reconstruction. PLoS One 2017; 12:e0182238. [PMID: 28800608 PMCID: PMC5553649 DOI: 10.1371/journal.pone.0182238] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2017] [Accepted: 06/25/2017] [Indexed: 12/29/2022] Open
Abstract
Phylogenetic trees inferred using commonly-used models of sequence evolution are unrooted, but the root position matters both for interpretation and downstream applications. This issue has been long recognized; however, whether the potential for discordance between the species tree and gene trees impacts methods of rooting a phylogenetic tree has not been extensively studied. In this paper, we introduce a new method of rooting a tree based on its branch length distribution; our method, which minimizes the variance of root to tip distances, is inspired by the traditional midpoint rerooting and is justified when deviations from the strict molecular clock are random. Like midpoint rerooting, the method can be implemented in a linear time algorithm. In extensive simulations that consider discordance between gene trees and the species tree, we show that the new method is more accurate than midpoint rerooting, but its relative accuracy compared to using outgroups to root gene trees depends on the size of the dataset and levels of deviations from the strict clock. We show high levels of error for all methods of rooting estimated gene trees due to factors that include effects of gene tree discordance, deviations from the clock, and gene tree estimation error. Our simulations, however, did not reveal significant differences between two equivalent methods for species tree estimation that use rooted and unrooted input, namely, STAR and NJst. Nevertheless, our results point to limitations of existing scalable rooting methods.
Collapse
Affiliation(s)
- Uyen Mai
- Dept of Computer Science and Engineering, University of California at San Diego, San Diego, CA, United States of America
| | - Erfan Sayyari
- Dept of Electrical and Computer Engineering, University of California at San Diego, San Diego, CA, United States of America
| | - Siavash Mirarab
- Dept of Electrical and Computer Engineering, University of California at San Diego, San Diego, CA, United States of America
| |
Collapse
|
11
|
Bogdanowicz D, Giaro K. Comparing Phylogenetic Trees by Matching Nodes Using the Transfer Distance Between Partitions. J Comput Biol 2017; 24:422-435. [PMID: 28177699 PMCID: PMC5421509 DOI: 10.1089/cmb.2016.0204] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Ability to quantify dissimilarity of different phylogenetic trees describing the relationship between the same group of taxa is required in various types of phylogenetic studies. For example, such metrics are used to assess the quality of phylogeny construction methods, to define optimization criteria in supertree building algorithms, or to find horizontal gene transfer (HGT) events. Among the set of metrics described so far in the literature, the most commonly used seems to be the Robinson-Foulds distance. In this article, we define a new metric for rooted trees-the Matching Pair (MP) distance. The MP metric uses the concept of the minimum-weight perfect matching in a complete bipartite graph constructed from partitions of all pairs of leaves of the compared phylogenetic trees. We analyze the properties of the MP metric and present computational experiments showing its potential applicability in tasks related to finding the HGT events.
Collapse
Affiliation(s)
- Damian Bogdanowicz
- Department of Algorithms and System Modeling, Gdansk University of Technology , Gdansk, Poland
| | - Krzysztof Giaro
- Department of Algorithms and System Modeling, Gdansk University of Technology , Gdansk, Poland
| |
Collapse
|
12
|
|
13
|
Wu Y. An algorithm for constructing parsimonious hybridization networks with multiple phylogenetic trees. J Comput Biol 2013; 20:792-804. [PMID: 24093230 PMCID: PMC3791036 DOI: 10.1089/cmb.2013.0072] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
A phylogenetic network is a model for reticulate evolution. A hybridization network is one type of phylogenetic network for a set of discordant gene trees and "displays" each gene tree. A central computational problem on hybridization networks is: given a set of gene trees, reconstruct the minimum (i.e., most parsimonious) hybridization network that displays each given gene tree. This problem is known to be NP-hard, and existing approaches for this problem are either heuristics or making simplifying assumptions (e.g., work with only two input trees or assume some topological properties). In this article, we develop an exact algorithm (called PIRNC) for inferring the minimum hybridization networks from multiple gene trees. The PIRNC algorithm does not rely on structural assumptions (e.g., the so-called galled networks). To the best of our knowledge, PIRNC is the first exact algorithm implemented for this formulation. When the number of reticulation events is relatively small (say, four or fewer), PIRNC runs reasonably efficient even for moderately large datasets. For building more complex networks, we also develop a heuristic version of PIRNC called PIRNCH. Simulation shows that PIRNCH usually produces networks with fewer reticulation events than those by an existing method. PIRNC and PIRNCH have been implemented as part of the software package called PIRN and is available online.
Collapse
Affiliation(s)
- Yufeng Wu
- Computer Science and Engineering Department, 371 Fairfield Road, Unit 2155, University of Connecticut Storrs, CT 06269
| |
Collapse
|
14
|
Affiliation(s)
- Zhi-Zhong Chen
- Division of Information System Design, Tokyo Denki University, Ishizaka, Hatoyama, Hiki, Saitama, Japan
| | - Lusheng Wang
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR, People's Republic of China
| |
Collapse
|
15
|
Kirkpatrick B, Reshef Y, Finucane H, Jiang H, Zhu B, Karp RM. Comparing pedigree graphs. J Comput Biol 2012; 19:998-1014. [PMID: 22897201 DOI: 10.1089/cmb.2011.0254] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Pedigree graphs, or family trees, are typically constructed by an expensive process of examining genealogical records to determine which pairs of individuals are parent and child. New methods to automate this process take as input genetic data from a set of extant individuals and reconstruct ancestral individuals. There is a great need to evaluate the quality of these methods by comparing the estimated pedigree to the true pedigree. In this article, we consider two main pedigree comparison problems. The first is the pedigree isomorphism problem, for which we present a linear-time algorithm for leaf-labeled pedigrees. The second is the pedigree edit distance problem, for which we present (1) several algorithms that are fast and exact in various special cases, and (2) a general, randomized heuristic algorithm. In the negative direction, we first prove that the pedigree isomorphism problem is as hard as the general graph isomorphism problem, and that the sub-pedigree isomorphism problem is NP-hard. We then show that the pedigree edit distance problem is APX-hard in general and NP-hard on leaf-labeled pedigrees. We use simulated pedigrees to compare our edit-distance algorithms to each other as well as to a branch-and-bound algorithm that always finds an optimal solution.
Collapse
Affiliation(s)
- Bonnie Kirkpatrick
- Electrical Engineering and Computer Sciences, University of California, Berkeley, California, USA.
| | | | | | | | | | | |
Collapse
|
16
|
Chen ZZ, Wang L, Yamanaka S. A fast tool for minimum hybridization networks. BMC Bioinformatics 2012; 13:155. [PMID: 22748099 PMCID: PMC3434172 DOI: 10.1186/1471-2105-13-155] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2012] [Accepted: 06/30/2012] [Indexed: 11/02/2022] Open
Abstract
BACKGROUND Due to hybridization events in evolution, studying two different genes of a set of species may yield two related but different phylogenetic trees for the set of species. In this case, we want to combine the two phylogenetic trees into a hybridization network with the fewest hybridization events. This leads to three computational problems, namely, the problem of computing the minimum size of a hybridization network, the problem of constructing one minimum hybridization network, and the problem of enumerating a representative set of minimum hybridization networks. The previously best software tools for these problems (namely, Chen and Wang's HybridNet and Albrecht et al.'s Dendroscope 3) run very slowly for large instances that cannot be reduced to relatively small instances. Indeed, when the minimum size of a hybridization network of two given trees is larger than 23 and the problem for the trees cannot be reduced to relatively smaller independent subproblems, then HybridNet almost always takes longer than 1 day and Dendroscope 3 often fails to complete. Thus, a faster software tool for the problems is in need. RESULTS We develop a software tool in ANSI C, named FastHN, for the following problems: Computing the minimum size of a hybridization network, constructing one minimum hybridization network, and enumerating a representative set of minimum hybridization networks. We obtain FastHN by refining HybridNet with three ideas. The first idea is to preprocess the input trees so that the trees become smaller or the problem becomes to solve two or more relatively smaller independent subproblems. The second idea is to use a fast algorithm for computing the rSPR distance of two given phylognetic trees to cut more branches of the search tree in the exhaustive-search stage of the algorithm. The third idea is that during the exhaustive-search stage of the algorithm, we find two sibling leaves in one of the two forests (obtained from the given trees by cutting some edges) such that they are as far as possible in the other forest. As the result, FastHN always runs much faster than HybridNet. Unlike Dendroscope 3, FastHN is a single-threaded program. Despite this disadvantage, our experimental data shows that FastHN runs substantially faster than the multi-threaded Dendroscope 3 on a PC with multiple cores. Indeed, FastHN can finish within 16 minutes (on average on a Windows-7 (x64) desktop PC with i7-2600 CPU) even if the minimum size of a hybridization network of two given trees is about 25, the trees each have 100 leaves, and the problem for the input trees cannot be reduced to two or more independent subproblems via cluster reductions. It is also worth mentioning that like HybridNet, FastHN does not use much memory (indeed, the amount of memory is at most quadratic in the input size). In contrast, Dendroscope 3 uses a huge amount of memory. Executables of FastHN for Windows XP (x86), Windows 7 (x64), Linux, and Mac OS are available (see the Results and discussion section for details). CONCLUSIONS For both biological datasets and simulated datasets, our experimental results show that FastHN runs substantially faster than HybridNet and Dendroscope 3. The superiority of FastHN in speed over the previous tools becomes more significant as the hybridization number becomes larger. In addition, FastHN uses much less memory than Dendroscope 3 and uses the same amount of memory as HybridNet.
Collapse
Affiliation(s)
- Zhi-Zhong Chen
- Division of Information System Design, Tokyo Denki University, Hatoyama, Hiki, Saitama, Japan.
| | | | | |
Collapse
|
17
|
Chen ZZ, Wang L. Algorithms for reticulate networks of multiple phylogenetic trees. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 9:372-384. [PMID: 22025766 DOI: 10.1109/tcbb.2011.137] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
A reticulate network N of multiple phylogenetic trees may have vertices with two or more parents (called reticulation vertices). There are two ways to define the reticulation number of N. One is to define it as the number of reticulation vertices in N; in this case, a reticulate network with the smallest reticulation number is called an optimal type-I reticulate network of the trees. The other is to define it as the total number of parents of reticulation vertices in N minus the number of reticulation vertices in N; in this case, a reticulate network with the smallest reticulation number is called an optimal type-II reticulate network of the trees. In this paper, we present a fast algorithm for constructing one or all optimal type-I reticulate networks of multiple phylogenetic trees. We then use the algorithm together with other ideas to obtain an algorithm for estimating a lower bound on the reticulation number of an optimal type-II reticulate network of the input trees. To our knowledge, these are the first fast algorithms for the problems. Our experimental data shows that our algorithms can construct optimal type-I reticulate networks rapidly and can compute better lower bounds for optimal type-II reticulate networks within much shorter time than the previously best program.
Collapse
|
18
|
Leigh JW, Lapointe FJ, Lopez P, Bapteste E. Evaluating phylogenetic congruence in the post-genomic era. Genome Biol Evol 2011; 3:571-87. [PMID: 21712432 PMCID: PMC3156567 DOI: 10.1093/gbe/evr050] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/27/2011] [Indexed: 12/04/2022] Open
Abstract
Congruence is a broadly applied notion in evolutionary biology used to justify multigene phylogeny or phylogenomics, as well as in studies of coevolution, lateral gene transfer, and as evidence for common descent. Existing methods for identifying incongruence or heterogeneity using character data were designed for data sets that are both small and expected to be rarely incongruent. At the same time, methods that assess incongruence using comparison of trees test a null hypothesis of uncorrelated tree structures, which may be inappropriate for phylogenomic studies. As such, they are ill-suited for the growing number of available genome sequences, most of which are from prokaryotes and viruses, either for phylogenomic analysis or for studies of the evolutionary forces and events that have shaped these genomes. Specifically, many existing methods scale poorly with large numbers of genes, cannot accommodate high levels of incongruence, and do not adequately model patterns of missing taxa for different markers. We propose the development of novel incongruence assessment methods suitable for the analysis of the molecular evolution of the vast majority of life and support the investigation of homogeneity of evolutionary process in cases where markers do not share identical tree structures.
Collapse
Affiliation(s)
- Jessica W Leigh
- Department of Mathematics and Statistics, University of Otago, Dunedin, New Zealand.
| | | | | | | |
Collapse
|
19
|
Leigh JW, Schliep K, Lopez P, Bapteste E. Let Them Fall Where They May: Congruence Analysis in Massive Phylogenetically Messy Data Sets. Mol Biol Evol 2011; 28:2773-85. [DOI: 10.1093/molbev/msr110] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
|
20
|
Marcet-Houben M, Gabaldón T. TreeKO: a duplication-aware algorithm for the comparison of phylogenetic trees. Nucleic Acids Res 2011; 39:e66. [PMID: 21335609 PMCID: PMC3105381 DOI: 10.1093/nar/gkr087] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
Comparisons of tree topologies provide relevant information in evolutionary studies. Most existing methods share the drawback of requiring a complete and exact mapping of terminal nodes between the compared trees. This severely limits the scope of genome-wide analyses, since trees containing duplications are pruned arbitrarily or discarded. To overcome this, we have developed treeKO, an algorithm that enables the comparison of tree topologies, even in the presence of duplication and loss events. To do so treeKO recursively splits gene trees into pruned trees containing only orthologs to subsequently compute a distance based on the combined analyses of all pruned tree comparisons. In addition treeKO, implements the possibility of computing phylome support values, and reconciliation-based measures such as the number of inferred duplication and loss events.
Collapse
|
21
|
Kannan L, Li H, Mushegian A. A polynomial-time algorithm computing lower and upper bounds of the rooted subtree prune and regraft distance. J Comput Biol 2010; 18:743-57. [PMID: 21166560 DOI: 10.1089/cmb.2010.0045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Rooted, leaf-labeled trees are used in biology to represent hierarchical relationships of various entities, most notably the evolutionary history of molecules and organisms. Rooted Subtree Prune and Regraft (rSPR) operation is a tree rearrangement operation that is used to transform a tree into another tree that has the same set of leaf labels. The minimum number of rSPR operations that transform one tree into another is denoted by d(rSPR) and gives a measure of dissimilarity between the trees, which can be used to compare trees obtained by different approaches, or, in the context of phylogenetic analysis, to detect horizontal gene transfer events by finding incongruences between trees of different evolving characters. The problem of computing the exact d(rSPR) measure is NP-hard, and most algorithms resort to finding sequences of rSPR operations that are sufficient for transforming one tree into another, thereby giving upper bound heuristics for the distance. In this article, we present an O(n⁴) recursive algorithm D-Clust that gives both lower bound and upper bound heuristics for the distance between trees with n shared leaves and also gives a sequence of operations that transforms one tree into another. Our experiments on simulated pairs of trees containing up to 100 leaves showed that the two bounds are almost equal for small distances, thereby giving the nearly-precise actual value, and that the upper bound tends to be close to the upper bounds given by other approaches for all pairs of trees.
Collapse
Affiliation(s)
- Lavanya Kannan
- Bioinformatics Center, Stowers Institute for Medical Research, Kansas City, Missouri, USA.
| | | | | |
Collapse
|
22
|
Wu Y. Close lower and upper bounds for the minimum reticulate network of multiple phylogenetic trees. ACTA ACUST UNITED AC 2010; 26:i140-8. [PMID: 20529899 PMCID: PMC2881383 DOI: 10.1093/bioinformatics/btq198] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Reticulate network is a model for displaying and quantifying the effects of complex reticulate processes on the evolutionary history of species undergoing reticulate evolution. A central computational problem on reticulate networks is: given a set of phylogenetic trees (each for some region of the genomes), reconstruct the most parsimonious reticulate network (called the minimum reticulate network) that combines the topological information contained in the given trees. This problem is well-known to be NP-hard. Thus, existing approaches for this problem either work with only two input trees or make simplifying topological assumptions. RESULTS We present novel results on the minimum reticulate network problem. Unlike existing approaches, we address the fully general problem: there is no restriction on the number of trees that are input, and there is no restriction on the form of the allowed reticulate network. We present lower and upper bounds on the minimum number of reticulation events in the minimum reticulate network (and infer an approximately parsimonious reticulate network). A program called PIRN implements these methods, which also outputs a graphical representation of the inferred network. Empirical results on simulated and biological data show that our methods are practical for a wide range of data. More importantly, the lower and upper bounds match for many datasets (especially when the number of trees is small or reticulation level is low), and this allows us to solve the minimum reticulate network problem exactly for these datasets. AVAILABILITY A software tool, PIRN, is available for download from the web page: http://www.engr.uconn.edu/~ywu. SUPPLEMENTARY INFORMATION Supplementary data is available at Bioinformatics online.
Collapse
Affiliation(s)
- Yufeng Wu
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA.
| |
Collapse
|
23
|
|
24
|
Matsen FA. constNJ: An Algorithm to Reconstruct Sets of Phylogenetic Trees Satisfying Pairwise Topological Constraints. J Comput Biol 2010; 17:799-818. [DOI: 10.1089/cmb.2009.0201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
- Frederick A. Matsen
- Program in Computational Biology, Fred Hutchinson Cancer Research Center 1100, Seattle, Washington, USA
| |
Collapse
|
25
|
Hill T, Nordström KJV, Thollesson M, Säfström TM, Vernersson AKE, Fredriksson R, Schiöth HB. SPRIT: Identifying horizontal gene transfer in rooted phylogenetic trees. BMC Evol Biol 2010; 10:42. [PMID: 20152048 PMCID: PMC2829038 DOI: 10.1186/1471-2148-10-42] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2009] [Accepted: 02/13/2010] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Phylogenetic trees based on sequences from a set of taxa can be incongruent due to horizontal gene transfer (HGT). By identifying the HGT events, we can reconcile the gene trees and derive a taxon tree that adequately represents the species' evolutionary history. One HGT can be represented by a rooted Subtree Prune and Regraft (RSPR) operation and the number of RSPRs separating two trees corresponds to the minimum number of HGT events. Identifying the minimum number of RSPRs separating two trees is NP-hard, but the problem can be reduced to fixed parameter tractable. A number of heuristic and two exact approaches to identifying the minimum number of RSPRs have been proposed. This is the first implementation delivering an exact solution as well as the intermediate trees connecting the input trees. RESULTS We present the SPR Identification Tool (SPRIT), a novel algorithm that solves the fixed parameter tractable minimum RSPR problem and its GPL licensed Java implementation. The algorithm can be used in two ways, exhaustive search that guarantees the minimum RSPR distance and a heuristic approach that guarantees finding a solution, but not necessarily the minimum one. We benchmarked SPRIT against other software in two different settings, small to medium sized trees i.e. five to one hundred taxa and large trees i.e. thousands of taxa. In the small to medium tree size setting with random artificial incongruence, SPRIT's heuristic mode outperforms the other software by always delivering a solution with a low overestimation of the RSPR distance. In the large tree setting SPRIT compares well to the alternatives when benchmarked on finding a minimum solution within a reasonable time. SPRIT presents both the minimum RSPR distance and the intermediate trees. CONCLUSIONS When used in exhaustive search mode, SPRIT identifies the minimum number of RSPRs needed to reconcile two incongruent rooted trees. SPRIT also performs quick approximations of the minimum RSPR distance, which are comparable to, and often better than, purely heuristic solutions. Put together, SPRIT is an excellent tool for identification of HGT events and pinpointing which taxa have been involved in HGT.
Collapse
Affiliation(s)
- Tobias Hill
- Department of Neuroscience, Biomedical Centre, Uppsala University, Box 593, SE-751 24 Uppsala, Sweden
| | - Karl JV Nordström
- Department of Neuroscience, Biomedical Centre, Uppsala University, Box 593, SE-751 24 Uppsala, Sweden
| | - Mikael Thollesson
- Department of Evolution, Genomics and Systematics, Uppsala University, Norbyvägen 18C, SE-752 36 Uppsala, Sweden
| | - Tommy M Säfström
- Department of Neuroscience, Biomedical Centre, Uppsala University, Box 593, SE-751 24 Uppsala, Sweden
| | - Andreas KE Vernersson
- Department of Neuroscience, Biomedical Centre, Uppsala University, Box 593, SE-751 24 Uppsala, Sweden
| | - Robert Fredriksson
- Department of Neuroscience, Biomedical Centre, Uppsala University, Box 593, SE-751 24 Uppsala, Sweden
| | - Helgi B Schiöth
- Department of Neuroscience, Biomedical Centre, Uppsala University, Box 593, SE-751 24 Uppsala, Sweden
| |
Collapse
|
26
|
Fast Computation of the Exact Hybridization Number of Two Phylogenetic Trees. BIOINFORMATICS RESEARCH AND APPLICATIONS 2010. [DOI: 10.1007/978-3-642-13078-6_23] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
|
27
|
Whidden C, Beiko RG, Zeh N. Fast FPT Algorithms for Computing Rooted Agreement Forests: Theory and Experiments. EXPERIMENTAL ALGORITHMS 2010. [DOI: 10.1007/978-3-642-13193-6_13] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|