1
|
Roddur MS, Snir S, El-Kebir M. Enforcing Temporal Consistency in Migration History Inference. J Comput Biol 2024; 31:396-415. [PMID: 38754138 DOI: 10.1089/cmb.2023.0352] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/18/2024] Open
Abstract
In addition to undergoing evolution, members of biological populations may also migrate between locations. Examples include the spread of tumor cells from the primary tumor to distant metastases or the spread of pathogens from one host to another. One may represent migration histories by assigning a location label to each vertex of a given phylogenetic tree such that an edge connecting vertices with distinct locations represents a migration. Some biological populations undergo comigration, a phenomenon where multiple taxa from distinct lineages simultaneously comigrate from one location to another. In this work, we show that a previous problem statement for inferring migration histories that are parsimonious in terms of migrations and comigrations may lead to temporally inconsistent solutions. To remedy this deficiency, we introduce precise definitions of temporal consistency of comigrations in a phylogenetic tree, leading to three successive problems. First, we formulate the temporally consistent comigration problem to check if a set of comigrations is temporally consistent and provide a linear time algorithm for solving this problem. Second, we formulate the parsimonious consistent comigrations (PCC) problem, which aims to find comigrations given a location labeling of a phylogenetic tree. We show that PCC is NP-hard. Third, we formulate the parsimonious consistent comigration history (PCCH) problem, which infers the migration history given a phylogenetic tree and locations of its extant vertices only. We show that PCCH is NP-hard as well. On the positive side, we propose integer linear programming models to solve the PCC and PCCH problems. We demonstrate our algorithms on simulated and real data.
Collapse
Affiliation(s)
- Mrinmoy Saha Roddur
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| | - Sagi Snir
- Department of Evolutionary Biology, University of Haifa, Haifa, Israel
| | - Mohammed El-Kebir
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
- Cancer Center at Illinois, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| |
Collapse
|
2
|
Schaller D, Hartmann T, Lafond M, Stadler PF, Wieseke N, Hellmuth M. Relative timing information and orthology in evolutionary scenarios. Algorithms Mol Biol 2023; 18:16. [PMID: 37940998 PMCID: PMC10634191 DOI: 10.1186/s13015-023-00240-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Accepted: 08/23/2023] [Indexed: 11/10/2023] Open
Abstract
BACKGROUND Evolutionary scenarios describing the evolution of a family of genes within a collection of species comprise the mapping of the vertices of a gene tree T to vertices and edges of a species tree S. The relative timing of the last common ancestors of two extant genes (leaves of T) and the last common ancestors of the two species (leaves of S) in which they reside is indicative of horizontal gene transfers (HGT) and ancient duplications. Orthologous gene pairs, on the other hand, require that their last common ancestors coincides with a corresponding speciation event. The relative timing information of gene and species divergences is captured by three colored graphs that have the extant genes as vertices and the species in which the genes are found as vertex colors: the equal-divergence-time (EDT) graph, the later-divergence-time (LDT) graph and the prior-divergence-time (PDT) graph, which together form an edge partition of the complete graph. RESULTS Here we give a complete characterization in terms of informative and forbidden triples that can be read off the three graphs and provide a polynomial time algorithm for constructing an evolutionary scenario that explains the graphs, provided such a scenario exists. While both LDT and PDT graphs are cographs, this is not true for the EDT graph in general. We show that every EDT graph is perfect. While the information about LDT and PDT graphs is necessary to recognize EDT graphs in polynomial-time for general scenarios, this extra information can be dropped in the HGT-free case. However, recognition of EDT graphs without knowledge of putative LDT and PDT graphs is NP-complete for general scenarios. In contrast, PDT graphs can be recognized in polynomial-time. We finally connect the EDT graph to the alternative definitions of orthology that have been proposed for scenarios with horizontal gene transfer. With one exception, the corresponding graphs are shown to be colored cographs.
Collapse
Affiliation(s)
- David Schaller
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, Leipzig, 04107, Germany
| | - Tom Hartmann
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, Leipzig, 04107, Germany
| | - Manuel Lafond
- Department of Computer Science, Université de Sherbrooke, 2500 boul. de l'Université, Sherbrooke, J1K 2R1, Canada
| | - Peter F Stadler
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, Leipzig, 04107, Germany
- Competence Center for Scalable Data Services and Solutions Dresden/Leipzig, Interdisciplinary Center for Bioinformatics, German Centre for Integrative Biodiversity Research (iDiv), and Leipzig Research Center for Civilization Diseases, Universität Leipzig, Augustusplatz 12, Leipzig, 04107, Germany
- Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, Leipzig, 04109, Germany
- Department of Theoretical Chemistry, University of Vienna, Währinger Straße 17, Vienna, 1090, Austria
- Facultad de Ciencias, Universidad National de Colombia, Sede Bogotá, Ciudad Universitaria, Bogotá, 111321, DC, Colombia
- Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM87501, USA
| | - Nicolas Wieseke
- Swarm Intelligence and Complex Systems Group, Faculty of Mathematics and Computer Science, Leipzig University, Augustusplatz 10, Leipzig, 04109, Germany
| | - Marc Hellmuth
- Department of Mathematics, Faculty of Science, Stockholm University, Stockholm, 10691, Sweden.
| |
Collapse
|
3
|
Indirect identification of horizontal gene transfer. J Math Biol 2021; 83:10. [PMID: 34218334 PMCID: PMC8254804 DOI: 10.1007/s00285-021-01631-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Revised: 04/06/2021] [Accepted: 06/13/2021] [Indexed: 12/04/2022]
Abstract
Several implicit methods to infer horizontal gene transfer (HGT) focus on pairs of genes that have diverged only after the divergence of the two species in which the genes reside. This situation defines the edge set of a graph, the later-divergence-time (LDT) graph, whose vertices correspond to genes colored by their species. We investigate these graphs in the setting of relaxed scenarios, i.e., evolutionary scenarios that encompass all commonly used variants of duplication-transfer-loss scenarios in the literature. We characterize LDT graphs as a subclass of properly vertex-colored cographs, and provide a polynomial-time recognition algorithm as well as an algorithm to construct a relaxed scenario that explains a given LDT. An edge in an LDT graph implies that the two corresponding genes are separated by at least one HGT event. The converse is not true, however. We show that the complete xenology relation is described by an rs-Fitch graph, i.e., a complete multipartite graph satisfying constraints on the vertex coloring. This class of vertex-colored graphs is also recognizable in polynomial time. We finally address the question “how much information about all HGT events is contained in LDT graphs” with the help of simulations of evolutionary scenarios with a wide range of duplication, loss, and HGT events. In particular, we show that a simple greedy graph editing scheme can be used to efficiently detect HGT events that are implicitly contained in LDT graphs.
Collapse
|
4
|
Schaller D, Geiß M, Stadler PF, Hellmuth M. Complete Characterization of Incorrect Orthology Assignments in Best Match Graphs. J Math Biol 2021; 82:20. [PMID: 33606106 PMCID: PMC7894253 DOI: 10.1007/s00285-021-01564-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Revised: 09/23/2020] [Accepted: 12/21/2020] [Indexed: 02/06/2023]
Abstract
Genome-scale orthology assignments are usually based on reciprocal best matches. In the absence of horizontal gene transfer (HGT), every pair of orthologs forms a reciprocal best match. Incorrect orthology assignments therefore are always false positives in the reciprocal best match graph. We consider duplication/loss scenarios and characterize unambiguous false-positive (u-fp) orthology assignments, that is, edges in the best match graphs (BMGs) that cannot correspond to orthologs for any gene tree that explains the BMG. Moreover, we provide a polynomial-time algorithm to identify all u-fp orthology assignments in a BMG. Simulations show that at least [Formula: see text] of all incorrect orthology assignments can be detected in this manner. All results rely only on the structure of the BMGs and not on any a priori knowledge about underlying gene or species trees.
Collapse
Affiliation(s)
- David Schaller
- Max-Planck-Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center of Bioinformatics, University of Leipzig, Härtelstraße 16-18, D-04107 Leipzig, Germany
| | - Manuela Geiß
- Software Competence Center Hagenberg GmbH, Softwarepark 21, A-4232 Hagenberg, Austria
| | - Peter F. Stadler
- Max-Planck-Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany
- Bioinformatics Group, Department of Computer Science, Interdisciplinary Center of Bioinformatics, German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Competence Center for Scalable Data Services and Solutions, and Leipzig Research Center for Civilization Diseases, Leipzig University, Härtelstraße 16-18, D-04107 Leipzig, Germany
- Inst. f. Theoretical Chemistry, University of Vienna, Währingerstraße 17, A-1090 Wien, Austria
- Facultad de Ciencias, Universidad National de Colombia, Bogotá, Colombia
- Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM 87501 USA
| | - Marc Hellmuth
- Department of Mathematics, Faculty of Science, Stockholm University, SE 106 91 Stockholm, Sweden
| |
Collapse
|
5
|
Lafond M, Hellmuth M. Reconstruction of time-consistent species trees. Algorithms Mol Biol 2020; 15:16. [PMID: 32843891 PMCID: PMC7439642 DOI: 10.1186/s13015-020-00175-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2020] [Accepted: 07/25/2020] [Indexed: 02/04/2023] Open
Abstract
BACKGROUND The history of gene families-which are equivalent to event-labeled gene trees-can to some extent be reconstructed from empirically estimated evolutionary event-relations containing pairs of orthologous, paralogous or xenologous genes. The question then arises as whether inferred event-labeled gene trees are "biologically feasible" which is the case if one can find a species tree with which the gene tree can be reconciled in a time-consistent way. RESULTS In this contribution, we consider event-labeled gene trees that contain speciations, duplications as well as horizontal gene transfer (HGT) and we assume that the species tree is unknown. Although many problems become NP-hard as soon as HGT and time-consistency are involved, we show, in contrast, that the problem of finding a time-consistent species tree for a given event-labeled gene can be solved in polynomial-time. We provide a cubic-time algorithm to decide whether a "time-consistent" species tree for a given event-labeled gene tree exists and, in the affirmative case, to construct the species tree within the same time-complexity.
Collapse
Affiliation(s)
- Manuel Lafond
- Department of Computer Science, Université de Sherbrooke, 2500 Boul. de l’Université, Sherbrooke, J1K 2R1 Canada
| | - Marc Hellmuth
- School of Computing, University of Leeds, E C Stoner Building, Leeds, LS2 9JT UK
| |
Collapse
|
6
|
Geiß M, Laffitte MEG, Sánchez AL, Valdivia DI, Hellmuth M, Rosales MH, Stadler PF. Best match graphs and reconciliation of gene trees with species trees. J Math Biol 2020; 80:1459-1495. [PMID: 32002659 PMCID: PMC7052050 DOI: 10.1007/s00285-020-01469-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2019] [Revised: 01/08/2020] [Indexed: 11/19/2022]
Abstract
A wide variety of problems in computational biology, most notably the assessment of orthology, are solved with the help of reciprocal best matches. Using an evolutionary definition of best matches that captures the intuition behind the concept we clarify rigorously the relationships between reciprocal best matches, orthology, and evolutionary events under the assumption of duplication/loss scenarios. We show that the orthology graph is a subgraph of the reciprocal best match graph (RBMG). We furthermore give conditions under which an RBMG that is a cograph identifies the correct orthlogy relation. Using computer simulations we find that most false positive orthology assignments can be identified as so-called good quartets-and thus corrected-in the absence of horizontal transfer. Horizontal transfer, however, may introduce also false-negative orthology assignments.
Collapse
Affiliation(s)
- Manuela Geiß
- Bioinformatics Group, Department of Computer Science, Interdisciplinary Center of Bioinformatics, University of Leipzig, Härtelstraße 16-18, 04107 Leipzig, Germany
| | - Marcos E. González Laffitte
- CONACYT-Instituto de Matemáticas, UNAM Juriquilla, Blvd. Juriquilla 3001, 76230 Juriquilla, Querétaro, QRO Mexico
| | - Alitzel López Sánchez
- CONACYT-Instituto de Matemáticas, UNAM Juriquilla, Blvd. Juriquilla 3001, 76230 Juriquilla, Querétaro, QRO Mexico
| | - Dulce I. Valdivia
- Centro de Ciencias Básicas, Universidad Autónoma de Aguascalientes, Av. Universidad 940, 20131 Aguascalientes, AGS México
- Instituto de Matemáticas, UNAM Juriquilla, Blvd. Juriquilla 3001, 76230 Juriquilla, Querétaro, QRO Mexico
| | - Marc Hellmuth
- Institute of Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Straße 47, 17487 Greifswald, Germany
- Center for Bioinformatics, Saarland University, Building E 2.1, P.O. Box 151150, 66041 Saarbrücken, Germany
| | - Maribel Hernández Rosales
- CONACYT-Instituto de Matemáticas, UNAM Juriquilla, Blvd. Juriquilla 3001, 76230 Juriquilla, Querétaro, QRO Mexico
| | - Peter F. Stadler
- Bioinformatics Group, Department of Computer Science, Interdisciplinary Center of Bioinformatics, University of Leipzig, Härtelstraße 16-18, 04107 Leipzig, Germany
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
- Competence Center for Scalable Data Services and Solutions, Leipzig Research Center for Civilization Diseases, Leipzig University, Härtelstraße 16-18, 04107 Leipzig, Germany
- Max-Planck-Institute for Mathematics in the Sciences, Inselstraße 22, 04103 Leipzig, Germany
- Inst. f. Theoretical Chemistry, University of Vienna, Währingerstraße 17, 1090 Wien, Austria
- Facultad de Ciencias, Universidad National de Colombia, Bogotá, Colombia
- Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM 87501 USA
| |
Collapse
|
7
|
Hellmuth M, Huber KT, Moulton V. Reconciling event-labeled gene trees with MUL-trees and species networks. J Math Biol 2019; 79:1885-1925. [PMID: 31410552 DOI: 10.1007/s00285-019-01414-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2018] [Revised: 05/08/2019] [Indexed: 11/30/2022]
Abstract
Phylogenomics commonly aims to construct evolutionary trees from genomic sequence information. One way to approach this problem is to first estimate event-labeled gene trees (i.e., rooted trees whose non-leaf vertices are labeled by speciation or gene duplication events), and to then look for a species tree which can be reconciled with this tree through a reconciliation map between the trees. In practice, however, it can happen that there is no such map from a given event-labeled tree to any species tree. An important situation where this might arise is where the species evolution is better represented by a network instead of a tree. In this paper, we therefore consider the problem of reconciling event-labeled trees with species networks. In particular, we prove that any event-labeled gene tree can be reconciled with some network and that, under certain mild assumptions on the gene tree, the network can even be assumed to be multi-arc free. To prove this result, we show that we can always reconcile the gene tree with some multi-labeled (MUL-)tree, which can then be "folded up" to produce the desired reconciliation and network. In addition, we study the interplay between reconciliation maps from event-labeled gene trees to MUL-trees and networks. Our results could be useful for understanding how genomes have evolved after undergoing complex evolutionary events such as polyploidy.
Collapse
Affiliation(s)
- Marc Hellmuth
- Institute of Mathematics and Computer Science, University of Greifswald, Greifswald, Germany. .,Center for Bioinformatics, Saarland University, Saarbrücken, Germany.
| | - Katharina T Huber
- School of Computing Sciences, University of East Anglia, Norwich, UK
| | - Vincent Moulton
- School of Computing Sciences, University of East Anglia, Norwich, UK
| |
Collapse
|