1
|
Wu Y, Zhang L. Computing the Bounds of the Number of Reticulations in a Tree-Child Network That Displays a Set of Trees. J Comput Biol 2024; 31:345-359. [PMID: 38285528 DOI: 10.1089/cmb.2023.0309] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2024] Open
Abstract
Phylogenetic network is an evolutionary model that uses a rooted directed acyclic graph (instead of a tree) to model an evolutionary history of species in which reticulate events (e.g., hybrid speciation or horizontal gene transfer) occurred. Tree-child network is a kind of phylogenetic network with structural constraints. Existing approaches for tree-child network reconstruction can be slow for large data. In this study, we present several computational approaches for bounding from below the number of reticulations in a tree-child network that displays a given set of rooted binary phylogenetic trees. In addition, we also present some theoretical results on bounding from above the number of reticulations. Through simulation, we demonstrate that the new lower bounds on the reticulation number for tree-child networks can practically be computed for large tree data. The bounds can provide estimates of reticulation for relatively large data.
Collapse
Affiliation(s)
- Yufeng Wu
- Department of Computer Science and Engineering, University of Connecticut, Storrs, Connecticut, USA
| | - Louxin Zhang
- Department of Mathematics, Center for Data Science and Machine Learning, National University of Singapore, Singapore, Singapore
| |
Collapse
|
2
|
Shi F, Li H, Rong G, Zhang Z, Wang J. Improved Fixed-Parameter Algorithm for the Tree Containment Problem on Unrooted Phylogenetic Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3539-3552. [PMID: 34506290 DOI: 10.1109/tcbb.2021.3111660] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Phylogenetic trees are unable to represent the evolutionary process for a collection of species if reticulation events happened, and a generalized model named phylogenetic network was introduced consequently. However, the representation of the evolutionary process for one gene is actually a phylogenetic tree that is "contained" in the phylogenetic network for the considered species containing the gene. Thus a fundamental computational problem named Tree Containment problem arises, which asks whether a phylogenetic tree is contained in a phylogenetic network. The previous research on the problem mainly focused on its rooted version of which the considered tree and network are rooted, and several algorithms were proposed when the considered network is binary or structure-restricted. There is almost no algorithm for its unrooted version except the recent fixed-parameter algorithm with runtime O(4kn2), where k and n are the reticulation number and size of the considered unrooted binary phylogenetic network N, respectively. As the runtime is a little expensive when considering big values of k, we aim to improve it and successfully propose a fixed-parameter algorithm with runtime O(2.594kn2) in the paper. Additionally, we experimentally show its effectiveness on biological data and simulated data.
Collapse
|
3
|
Pons M, Batle J. Combinatorial characterization of a certain class of words and a conjectured connection with general subclasses of phylogenetic tree-child networks. Sci Rep 2021; 11:21875. [PMID: 34750409 PMCID: PMC8575882 DOI: 10.1038/s41598-021-01166-w] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Accepted: 10/22/2021] [Indexed: 11/09/2022] Open
Abstract
The combinatorial study of phylogenetic networks has attracted much attention in recent times. In particular, one class of them, the so-called tree-child networks, are becoming the most prominent ones. However, their combinatorial properties are largely unknown. In this paper we address the problem of exactly counting them. We conjecture a relationship with the cardinality of a certain class of words. By solving the counting problem for the words, and on the basis of the conjecture, several simple recurrence formulas for general cases arise. Moreover, a precise asymptotic analysis is provided. Our results coincide with all current formulas in the literature for particular subclasses of tree-child networks, as well as with numerical results obtained for small networks. We expect that the study of the relationship between the newly defined words and the networks will lead to further combinatoric characterizations of this class of phylogenetic networks.
Collapse
Affiliation(s)
- Miquel Pons
- Departament de Física, Universitat de les Illes Balears, 07122, Palma de Mallorca, Balearic Islands, Spain.
| | - Josep Batle
- Departament de Física, Universitat de les Illes Balears, 07122, Palma de Mallorca, Balearic Islands, Spain
| |
Collapse
|
4
|
Yamada K, Chen ZZ, Wang L. Improved Practical Algorithms for Rooted Subtree Prune and Regraft (rSPR) Distance and Hybridization Number. J Comput Biol 2020; 27:1422-1432. [PMID: 32048865 DOI: 10.1089/cmb.2019.0432] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The problem of computing the rooted subtree prune and regraft (rSPR) distance of two phylogenetic trees is computationally hard and so is the problem of computing the hybridization number of two phylogenetic trees (denoted by Hybridization Number Computation [HNC]). Since they are important problems in phylogenetics, they have been studied extensively in the literature. Indeed, quite a number of exact or approximation algorithms have been designed and implemented for them. In this article, we design and implement several approximation algorithms for them and one exact algorithm for HNC. Our experimental results show that the resulting exact program is much faster (namely, more than 80 times faster for the easiest dataset used in the experiments) than the previous best and its superiority in speed becomes even more significant for more difficult instances. Moreover, the resulting approximation program's output has much better results than the previous bests; indeed, the outputs are always nearly optimal and often optimal. Of particular interest is the usage of the Monte Carlo tree search (MCTS) method in the design of our approximation algorithms. Our experimental results show that with MCTS, we can often solve HNC exactly within short time.
Collapse
Affiliation(s)
- Kohei Yamada
- Division of Information System Design, Tokyo Denki University, Hatoyama, Saitama 350-0394, Japan
| | - Zhi-Zhong Chen
- Division of Information System Design, Tokyo Denki University, Hatoyama, Saitama 350-0394, Japan
| | - Lusheng Wang
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| |
Collapse
|
5
|
Abstract
Bioinformatics regularly poses new challenges to algorithm engineers and theoretical computer scientists. This work surveys recent developments of parameterized algorithms and complexity for important NP-hard problems in bioinformatics. We cover sequence assembly and analysis, genome comparison and completion, and haplotyping and phylogenetics. Aside from reporting the state of the art, we give challenges and open problems for each topic.
Collapse
|
6
|
Oxelman B, Brysting AK, Jones GR, Marcussen T, Oberprieler C, Pfeil BE. Phylogenetics of Allopolyploids. ANNUAL REVIEW OF ECOLOGY EVOLUTION AND SYSTEMATICS 2017. [DOI: 10.1146/annurev-ecolsys-110316-022729] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Bengt Oxelman
- Gothenburg Global Biodiversity Centre, Department of Biology and Environmental Sciences, University of Gothenburg, SE405 30 Göteborg, Sweden
| | - Anne Krag Brysting
- Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, NO-0316 Oslo, Norway
| | - Graham R. Jones
- Gothenburg Global Biodiversity Centre, Department of Biology and Environmental Sciences, University of Gothenburg, SE405 30 Göteborg, Sweden
| | - Thomas Marcussen
- Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, NO-0316 Oslo, Norway
| | - Christoph Oberprieler
- Evolutionary and Systematic Botany Group, Institute of Plant Sciences, University of Regensburg, D-93053 Regensburg, Germany
| | - Bernard E. Pfeil
- Gothenburg Global Biodiversity Centre, Department of Biology and Environmental Sciences, University of Gothenburg, SE405 30 Göteborg, Sweden
| |
Collapse
|
7
|
Mirzaei S, Wu Y. Fast Construction of Near Parsimonious Hybridization Networks for Multiple Phylogenetic Trees. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:565-570. [PMID: 27295640 DOI: 10.1109/tcbb.2015.2462336] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Hybridization networks represent plausible evolutionary histories of species that are affected by reticulate evolutionary processes. An established computational problem on hybridization networks is constructing the most parsimonious hybridization network such that each of the given phylogenetic trees (called gene trees) is "displayed" in the network. There have been several previous approaches, including an exact method and several heuristics, for this NP-hard problem. However, the exact method is only applicable to a limited range of data, and heuristic methods can be less accurate and also slow sometimes. In this paper, we develop a new algorithm for constructing near parsimonious networks for multiple binary gene trees. This method is more efficient for large numbers of gene trees than previous heuristics. This new method also produces more parsimonious results on many simulated datasets as well as a real biological dataset than a previous method. We also show that our method produces topologically more accurate networks for many datasets.
Collapse
|
8
|
Marcussen T, Heier L, Brysting AK, Oxelman B, Jakobsen KS. From gene trees to a dated allopolyploid network: insights from the angiosperm genus Viola (Violaceae). Syst Biol 2014; 64:84-101. [PMID: 25281848 PMCID: PMC4265142 DOI: 10.1093/sysbio/syu071] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Allopolyploidization accounts for a significant fraction of speciation events in many eukaryotic lineages. However, existing phylogenetic and dating methods require tree-like topologies and are unable to handle the network-like phylogenetic relationships of lineages containing allopolyploids. No explicit framework has so far been established for evaluating competing network topologies, and few attempts have been made to date phylogenetic networks. We used a four-step approach to generate a dated polyploid species network for the cosmopolitan angiosperm genus Viola L. (Violaceae Batch.). The genus contains ca 600 species and both recent (neo-) and more ancient (meso-) polyploid lineages distributed over 16 sections. First, we obtained DNA sequences of three low-copy nuclear genes and one chloroplast region, from 42 species representing all 16 sections. Second, we obtained fossil-calibrated chronograms for each nuclear gene marker. Third, we determined the most parsimonious multilabeled genome tree and its corresponding network, resolved at the section (not the species) level. Reconstructing the "correct" network for a set of polyploids depends on recovering all homoeologs, i.e., all subgenomes, in these polyploids. Assuming the presence of Viola subgenome lineages that were not detected by the nuclear gene phylogenies ("ghost subgenome lineages") significantly reduced the number of inferred polyploidization events. We identified the most parsimonious network topology from a set of five competing scenarios differing in the interpretation of homoeolog extinctions and lineage sorting, based on (i) fewest possible ghost subgenome lineages, (ii) fewest possible polyploidization events, and (iii) least possible deviation from expected ploidy as inferred from available chromosome counts of the involved polyploid taxa. Finally, we estimated the homoploid and polyploid speciation times of the most parsimonious network. Homoploid speciation times were estimated by coalescent analysis of gene tree node ages. Polyploid speciation times were estimated by comparing branch lengths and speciation rates of lineages with and without ploidy shifts. Our analyses recognize Viola as an old genus (crown age 31 Ma) whose evolutionary history has been profoundly affected by allopolyploidy. Between 16 and 21 allopolyploidizations are necessary to explain the diversification of the 16 major lineages (sections) of Viola, suggesting that allopolyploidy has accounted for a high percentage-between 67% and 88%-of the speciation events at this level. The theoretical and methodological approaches presented here for (i) constructing networks and (ii) dating speciation events within a network, have general applicability for phylogenetic studies of groups where allopolyploidization has occurred. They make explicit use of a hitherto underexplored source of ploidy information from chromosome counts to help resolve phylogenetic cases where incomplete sequence data hampers network inference. Importantly, the coalescent-based method used herein circumvents the assumption of tree-like evolution required by most techniques for dating speciation events.
Collapse
Affiliation(s)
- Thomas Marcussen
- Department of Biosciences, Centre for Ecological and Evolutionary Synthesis (CEES), University of Oslo, PO Box 1066 Blindern, NO-0316 Oslo, Norway and Department of Biological and Environmental Sciences, University of Gothenburg, PO Box 461, 405 30 Gothenburg, Sweden Department of Biosciences, Centre for Ecological and Evolutionary Synthesis (CEES), University of Oslo, PO Box 1066 Blindern, NO-0316 Oslo, Norway and Department of Biological and Environmental Sciences, University of Gothenburg, PO Box 461, 405 30 Gothenburg, Sweden
| | - Lise Heier
- Department of Biosciences, Centre for Ecological and Evolutionary Synthesis (CEES), University of Oslo, PO Box 1066 Blindern, NO-0316 Oslo, Norway and Department of Biological and Environmental Sciences, University of Gothenburg, PO Box 461, 405 30 Gothenburg, Sweden
| | - Anne K Brysting
- Department of Biosciences, Centre for Ecological and Evolutionary Synthesis (CEES), University of Oslo, PO Box 1066 Blindern, NO-0316 Oslo, Norway and Department of Biological and Environmental Sciences, University of Gothenburg, PO Box 461, 405 30 Gothenburg, Sweden
| | - Bengt Oxelman
- Department of Biosciences, Centre for Ecological and Evolutionary Synthesis (CEES), University of Oslo, PO Box 1066 Blindern, NO-0316 Oslo, Norway and Department of Biological and Environmental Sciences, University of Gothenburg, PO Box 461, 405 30 Gothenburg, Sweden
| | - Kjetill S Jakobsen
- Department of Biosciences, Centre for Ecological and Evolutionary Synthesis (CEES), University of Oslo, PO Box 1066 Blindern, NO-0316 Oslo, Norway and Department of Biological and Environmental Sciences, University of Gothenburg, PO Box 461, 405 30 Gothenburg, Sweden
| |
Collapse
|
9
|
Iersel LV, Kelk S, Lekić N, Scornavacca C. A practical approximation algorithm for solving massive instances of hybridization number for binary and nonbinary trees. BMC Bioinformatics 2014; 15:127. [PMID: 24884964 PMCID: PMC4023542 DOI: 10.1186/1471-2105-15-127] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2013] [Accepted: 04/24/2014] [Indexed: 12/01/2022] Open
Abstract
BACKGROUND Reticulate events play an important role in determining evolutionary relationships. The problem of computing the minimum number of such events to explain discordance between two phylogenetic trees is a hard computational problem. Even for binary trees, exact solvers struggle to solve instances with reticulation number larger than 40-50. RESULTS Here we present CycleKiller and NonbinaryCycleKiller, the first methods to produce solutions verifiably close to optimality for instances with hundreds or even thousands of reticulations. CONCLUSIONS Using simulations, we demonstrate that these algorithms run quickly for large and difficult instances, producing solutions that are very close to optimality. As a spin-off from our simulations we also present TerminusEst, which is the fastest exact method currently available that can handle nonbinary trees: this is used to measure the accuracy of the NonbinaryCycleKiller algorithm. All three methods are based on extensions of previous theoretical work (SIDMA 26(4):1635-1656, TCBB 10(1):18-25, SIDMA 28(1):49-66) and are publicly available. We also apply our methods to real data.
Collapse
Affiliation(s)
- Leo van Iersel
- Centrum Wiskunde & Informatica (CWI), P.O. Box 94079, 1090 GB, Amsterdam, The Netherlands
| | - Steven Kelk
- Department of Knowledge Engineering (DKE), Maastricht University, P.O. Box 616, 6200 MD, Maastricht, The Netherlands
| | - Nela Lekić
- Department of Knowledge Engineering (DKE), Maastricht University, P.O. Box 616, 6200 MD, Maastricht, The Netherlands
| | - Celine Scornavacca
- ISEM, CNRS – Université Montpellier II, Place Eugène Bataillon, 34095 Montpellier, France
| |
Collapse
|
10
|
Wu Y. An algorithm for constructing parsimonious hybridization networks with multiple phylogenetic trees. J Comput Biol 2013; 20:792-804. [PMID: 24093230 PMCID: PMC3791036 DOI: 10.1089/cmb.2013.0072] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
A phylogenetic network is a model for reticulate evolution. A hybridization network is one type of phylogenetic network for a set of discordant gene trees and "displays" each gene tree. A central computational problem on hybridization networks is: given a set of gene trees, reconstruct the minimum (i.e., most parsimonious) hybridization network that displays each given gene tree. This problem is known to be NP-hard, and existing approaches for this problem are either heuristics or making simplifying assumptions (e.g., work with only two input trees or assume some topological properties). In this article, we develop an exact algorithm (called PIRNC) for inferring the minimum hybridization networks from multiple gene trees. The PIRNC algorithm does not rely on structural assumptions (e.g., the so-called galled networks). To the best of our knowledge, PIRNC is the first exact algorithm implemented for this formulation. When the number of reticulation events is relatively small (say, four or fewer), PIRNC runs reasonably efficient even for moderately large datasets. For building more complex networks, we also develop a heuristic version of PIRNC called PIRNCH. Simulation shows that PIRNCH usually produces networks with fewer reticulation events than those by an existing method. PIRNC and PIRNCH have been implemented as part of the software package called PIRN and is available online.
Collapse
Affiliation(s)
- Yufeng Wu
- Computer Science and Engineering Department, 371 Fairfield Road, Unit 2155, University of Connecticut Storrs, CT 06269
| |
Collapse
|
11
|
Cherry picking: a characterization of the temporal hybridization number for a set of phylogenies. Bull Math Biol 2013; 75:1879-90. [PMID: 23925727 DOI: 10.1007/s11538-013-9874-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2013] [Accepted: 07/04/2013] [Indexed: 10/26/2022]
Abstract
Recently, we have shown that calculating the minimum-temporal-hybridization number for a set [Formula: see text] of rooted binary phylogenetic trees is NP-hard and have characterized this minimum number when [Formula: see text] consists of exactly two trees. In this paper, we give the first characterization of the problem for [Formula: see text] being arbitrarily large. The characterization is in terms of cherries and the existence of a particular type of sequence. Furthermore, in an online appendix to the paper, we show that this new characterization can be used to show that computing the minimum-temporal hybridization number for two trees is fixed-parameter tractable.
Collapse
|
12
|
van Iersel L, Linz S. A quadratic kernel for computing the hybridization number of multiple trees. INFORM PROCESS LETT 2013. [DOI: 10.1016/j.ipl.2013.02.010] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
13
|
Jones G, Sagitov S, Oxelman B. Statistical inference of allopolyploid species networks in the presence of incomplete lineage sorting. Syst Biol 2013; 62:467-78. [PMID: 23427289 DOI: 10.1093/sysbio/syt012] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
Polyploidy is an important speciation mechanism, particularly in land plants. Allopolyploid species are formed after hybridization between otherwise intersterile parental species. Recent theoretical progress has led to successful implementation of species tree models that take population genetic parameters into account. However, these models have not included allopolyploid hybridization and the special problems imposed when species trees of allopolyploids are inferred. Here, 2 new models for the statistical inference of the evolutionary history of allopolyploids are evaluated using simulations and demonstrated on 2 empirical data sets. It is assumed that there has been a single hybridization event between 2 diploid species resulting in a genomic allotetraploid. The evolutionary history can be represented as a species network or as a multilabeled species tree, in which some pairs of tips are labeled with the same species. In one of the models (AlloppMUL), the multilabeled species tree is inferred directly. This is the simplest model and the most widely applicable, since fewer assumptions are made. The second model (AlloppNET) incorporates the hybridization event explicitly which means that fewer parameters need to be estimated. Both models are implemented in the BEAST framework. Simulations show that both models are useful and that AlloppNET is more accurate if the assumptions it is based on are valid. The models are demonstrated on previously analyzed data from the genera Pachycladon (Brassicaceae) and Silene (Caryophyllaceae).
Collapse
Affiliation(s)
- Graham Jones
- Department of Biological and Environmental Sciences, University of Gothenburg, Gothenburg, Sweden.
| | | | | |
Collapse
|
14
|
Affiliation(s)
- Zhi-Zhong Chen
- Division of Information System Design, Tokyo Denki University, Ishizaka, Hatoyama, Hiki, Saitama, Japan
| | - Lusheng Wang
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR, People's Republic of China
| |
Collapse
|
15
|
Cai Z, Eulenstein O, Janies D, Schwartz D. Reconstructing k-Reticulated Phylogenetic Network from a Set of Gene Trees. BIOINFORMATICS RESEARCH AND APPLICATIONS 2013. [PMCID: PMC7122431 DOI: 10.1007/978-3-642-38036-5_14] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The time complexity of existing algorithms for reconstructing a level-x phylogenetic network increases exponentially in x. In this paper, we propose a new classification of phylogenetic networks called k-reticulated network. A k-reticulated network can model all level-k networks and some level-x networks with x > k. We design algorithms for reconstructing k-reticulated network (k = 1 or 2) with minimum number of hybrid nodes from a set of m binary trees, each with n leaves in O(mn2) time. The implication is that some level-x networks with x > k can now be reconstructed in a faster way. We implemented our algorithm (ARTNET) and compared it with CMPT. We show that ARTNET outperforms CMPT in terms of running time and accuracy. We also consider the case when there does not exist a 2-reticulated network for the input trees. We present an algorithm computing a maximum subset of the species set so that a new set of subtrees can be combined into a 2-reticulated network.
Collapse
Affiliation(s)
- Zhipeng Cai
- Computer Science, Georgia State University, 34 Peachtree Street, Suite 1410, 30303 Atlanta, GA USA
| | | | - Daniel Janies
- Bioinformatics and Genomics, University of North Carolina at Charlotte, 9201 University City Blvd, Suite 315, 28223 Charlotte, NC USA
| | - Daniel Schwartz
- Physiology and Neurobiology, University of Connecticut, 75 North Eagleville Road, Unit 3156, 06269 Storrs, CT USA
| |
Collapse
|
16
|
Chen ZZ, Wang L, Yamanaka S. A fast tool for minimum hybridization networks. BMC Bioinformatics 2012; 13:155. [PMID: 22748099 PMCID: PMC3434172 DOI: 10.1186/1471-2105-13-155] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2012] [Accepted: 06/30/2012] [Indexed: 11/02/2022] Open
Abstract
BACKGROUND Due to hybridization events in evolution, studying two different genes of a set of species may yield two related but different phylogenetic trees for the set of species. In this case, we want to combine the two phylogenetic trees into a hybridization network with the fewest hybridization events. This leads to three computational problems, namely, the problem of computing the minimum size of a hybridization network, the problem of constructing one minimum hybridization network, and the problem of enumerating a representative set of minimum hybridization networks. The previously best software tools for these problems (namely, Chen and Wang's HybridNet and Albrecht et al.'s Dendroscope 3) run very slowly for large instances that cannot be reduced to relatively small instances. Indeed, when the minimum size of a hybridization network of two given trees is larger than 23 and the problem for the trees cannot be reduced to relatively smaller independent subproblems, then HybridNet almost always takes longer than 1 day and Dendroscope 3 often fails to complete. Thus, a faster software tool for the problems is in need. RESULTS We develop a software tool in ANSI C, named FastHN, for the following problems: Computing the minimum size of a hybridization network, constructing one minimum hybridization network, and enumerating a representative set of minimum hybridization networks. We obtain FastHN by refining HybridNet with three ideas. The first idea is to preprocess the input trees so that the trees become smaller or the problem becomes to solve two or more relatively smaller independent subproblems. The second idea is to use a fast algorithm for computing the rSPR distance of two given phylognetic trees to cut more branches of the search tree in the exhaustive-search stage of the algorithm. The third idea is that during the exhaustive-search stage of the algorithm, we find two sibling leaves in one of the two forests (obtained from the given trees by cutting some edges) such that they are as far as possible in the other forest. As the result, FastHN always runs much faster than HybridNet. Unlike Dendroscope 3, FastHN is a single-threaded program. Despite this disadvantage, our experimental data shows that FastHN runs substantially faster than the multi-threaded Dendroscope 3 on a PC with multiple cores. Indeed, FastHN can finish within 16 minutes (on average on a Windows-7 (x64) desktop PC with i7-2600 CPU) even if the minimum size of a hybridization network of two given trees is about 25, the trees each have 100 leaves, and the problem for the input trees cannot be reduced to two or more independent subproblems via cluster reductions. It is also worth mentioning that like HybridNet, FastHN does not use much memory (indeed, the amount of memory is at most quadratic in the input size). In contrast, Dendroscope 3 uses a huge amount of memory. Executables of FastHN for Windows XP (x86), Windows 7 (x64), Linux, and Mac OS are available (see the Results and discussion section for details). CONCLUSIONS For both biological datasets and simulated datasets, our experimental results show that FastHN runs substantially faster than HybridNet and Dendroscope 3. The superiority of FastHN in speed over the previous tools becomes more significant as the hybridization number becomes larger. In addition, FastHN uses much less memory than Dendroscope 3 and uses the same amount of memory as HybridNet.
Collapse
Affiliation(s)
- Zhi-Zhong Chen
- Division of Information System Design, Tokyo Denki University, Hatoyama, Hiki, Saitama, Japan.
| | | | | |
Collapse
|