1
|
Disanto F, Fuchs M, Paningbatan AR, Rosenberg NA. The distributions under two species-tree models of the number of root ancestral configurations for matching gene trees and species trees. ANN APPL PROBAB 2022. [DOI: 10.1214/22-aap1791] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Affiliation(s)
| | - Michael Fuchs
- Department of Mathematical Sciences, National Chengchi University
| | | | | |
Collapse
|
2
|
Palacios JA, Bhaskar A, Disanto F, Rosenberg NA. Enumeration of binary trees compatible with a perfect phylogeny. J Math Biol 2022; 84:54. [PMID: 35552538 PMCID: PMC9098623 DOI: 10.1007/s00285-022-01748-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Revised: 03/07/2022] [Accepted: 03/31/2022] [Indexed: 11/25/2022]
Abstract
Evolutionary models used for describing molecular sequence variation suppose that at a non-recombining genomic segment, sequences share ancestry that can be represented as a genealogy-a rooted, binary, timed tree, with tips corresponding to individual sequences. Under the infinitely-many-sites mutation model, mutations are randomly superimposed along the branches of the genealogy, so that every mutation occurs at a chromosomal site that has not previously mutated; if a mutation occurs at an interior branch, then all individuals descending from that branch carry the mutation. The implication is that observed patterns of molecular variation from this model impose combinatorial constraints on the hidden state space of genealogies. In particular, observed molecular variation can be represented in the form of a perfect phylogeny, a tree structure that fully encodes the mutational differences among sequences. For a sample of n sequences, a perfect phylogeny might not possess n distinct leaves, and hence might be compatible with many possible binary tree structures that could describe the evolutionary relationships among the n sequences. Here, we investigate enumerative properties of the set of binary ranked and unranked tree shapes that are compatible with a perfect phylogeny, and hence, the binary ranked and unranked tree shapes conditioned on an observed pattern of mutations under the infinitely-many-sites mutation model. We provide a recursive enumeration of these shapes. We consider both perfect phylogenies that can be represented as binary and those that are multifurcating. The results have implications for computational aspects of the statistical inference of evolutionary parameters that underlie sets of molecular sequences.
Collapse
Affiliation(s)
- Julia A Palacios
- Department of Statistics, Stanford University, Stanford, CA, USA.
- Department of Biomedical Data Science, Stanford Medicine, Stanford, CA, USA.
| | - Anand Bhaskar
- Department of Genetics, Stanford University, Stanford, CA, USA
| | | | | |
Collapse
|
3
|
Rosenberg NA. On the Colijn-Plazzotta numbering scheme for unlabeled binary rooted trees. DISCRETE APPLIED MATHEMATICS (AMSTERDAM, NETHERLANDS : 1988) 2021; 291:88-98. [PMID: 33364668 PMCID: PMC7751944 DOI: 10.1016/j.dam.2020.11.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Colijn & Plazzotta (Syst. Biol. 67:113-126, 2018) introduced a scheme for bijectively associating the unlabeled binary rooted trees with the positive integers. First, the rank 1 is associated with the 1-leaf tree. Proceeding recursively, ordered pair (k 1, k 2), k 1 ⩾ k 2 ⩾ 1, is then associated with the tree whose left subtree has rank k 1 and whose right subtree has rank k 2. Following dictionary order on ordered pairs, the tree whose left and right subtrees have the ordered pair of ranks (k 1, k 2) is assigned rank k 1(k 1 - 1)/2 + 1 + k 2. With this ranking, given a number of leaves n, we determine recursions for a n , the smallest rank assigned to some tree with n leaves, and b n , the largest rank assigned to some tree with n leaves. The smallest rank a n is assigned to the maximally balanced tree, and the largest rank b n is assigned to the caterpillar. For n equal to a power of 2, the value of a n is seen to increase exponentially with 2α n for a constant α ≈ 1.24602; more generally, we show it is bounded a n < 1.5 n . The value of b n is seen to increase with 2 β ( 2 n ) for a constant β ≈ 1.05653. The great difference in the rates of increase for a n and b n indicates that as the index v is incremented, the number of leaves for the tree associated with rank v quickly traverses a wide range of values. We interpret the results in relation to applications in evolutionary biology.
Collapse
Affiliation(s)
- Noah A Rosenberg
- Department of Biology, Stanford University, Stanford, CA 94305 USA
| |
Collapse
|
4
|
Truszkowski J, Scornavacca C, Pardi F. Computing the probability of gene trees concordant with the species tree in the multispecies coalescent. Theor Popul Biol 2020; 137:22-31. [PMID: 33333117 DOI: 10.1016/j.tpb.2020.12.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2020] [Revised: 12/04/2020] [Accepted: 12/08/2020] [Indexed: 10/22/2022]
Abstract
The multispecies coalescent process models the genealogical relationships of genes sampled from several species, enabling useful predictions about phenomena such as the discordance between a gene tree and the species phylogeny due to incomplete lineage sorting. Conversely, knowledge of large collections of gene trees can inform us about several aspects of the species phylogeny, such as its topology and ancestral population sizes. A fundamental open problem in this context is how to efficiently compute the probability of a gene tree topology, given the species phylogeny. Although a number of algorithms for this task have been proposed, they either produce approximate results, or, when they are exact, they do not scale to large data sets. In this paper, we present some progress towards exact and efficient computation of the probability of a gene tree topology. We provide a new algorithm that, given a species tree and the number of genes sampled for each species, calculates the probability that the gene tree topology will be concordant with the species tree. Moreover, we provide an algorithm that computes the probability of any specific gene tree topology concordant with the species tree. Both algorithms run in polynomial time and have been implemented in Python. Experiments show that they are able to analyze data sets where thousands of genes are sampled in a matter of minutes to hours.
Collapse
Affiliation(s)
| | - Celine Scornavacca
- ISEM, CNRS, Université Montpellier, Montpellier, France; Institut de Biologie Computationnelle, Montpellier, France
| | - Fabio Pardi
- LIRMM, CNRS, Université Montpellier, Montpellier, France; Institut de Biologie Computationnelle, Montpellier, France.
| |
Collapse
|
5
|
Counting and sampling gene family evolutionary histories in the duplication-loss and duplication-loss-transfer models. J Math Biol 2020; 80:1353-1388. [PMID: 32060618 PMCID: PMC7052048 DOI: 10.1007/s00285-019-01465-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Revised: 11/18/2019] [Indexed: 10/28/2022]
Abstract
Given a set of species whose evolution is represented by a species tree, a gene family is a group of genes having evolved from a single ancestral gene. A gene family evolves along the branches of a species tree through various mechanisms, including-but not limited to-speciation ([Formula: see text]), gene duplication ([Formula: see text]), gene loss ([Formula: see text]), and horizontal gene transfer ([Formula: see text]). The reconstruction of a gene tree representing the evolution of a gene family constrained by a species tree is an important problem in phylogenomics. However, unlike in the multispecies coalescent evolutionary model that considers only speciation and incomplete lineage sorting events, very little is known about the search space for gene family histories accounting for gene duplication, gene loss and horizontal gene transfer (the [Formula: see text]-model). In this work, we introduce the notion of evolutionary histories defined as a binary ordered rooted tree describing the evolution of a gene family, constrained by a species tree in the [Formula: see text]-model. We provide formal grammars describing the set of all evolutionary histories that are compatible with a given species tree, whether it is ranked or unranked. These grammars allow us, using either analytic combinatorics or dynamic programming, to efficiently compute the number of histories of a given size, and also to generate random histories of a given size under the uniform distribution. We apply these tools to obtain exact asymptotics for the number of gene family histories for two species trees, the rooted caterpillar and complete binary tree, as well as estimates of the range of the exponential growth factor of the number of histories for random species trees of size up to 25. Our results show that including horizontal gene transfers induce a dramatic increase of the number of evolutionary histories. We also show that, within ranked species trees, the number of evolutionary histories in the [Formula: see text]-model is almost independent of the species tree topology. These results establish firm foundations for the development of ensemble methods for the prediction of reconciliations.
Collapse
|
6
|
Enumeration of compact coalescent histories for matching gene trees and species trees. J Math Biol 2018; 78:155-188. [PMID: 30116881 DOI: 10.1007/s00285-018-1271-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2018] [Revised: 07/12/2018] [Indexed: 10/28/2022]
Abstract
Compact coalescent histories are combinatorial structures that describe for a given gene tree G and species tree S possibilities for the numbers of coalescences of G that take place on the various branches of S. They have been introduced as a data structure for evaluating probabilities of gene tree topologies conditioning on species trees, reducing computation time compared to standard coalescent histories. When gene trees and species trees have a matching labeled topology [Formula: see text], the compact coalescent histories of t are encoded by particular integer labelings of the branches of t, each integer specifying the number of coalescent events of G present in a branch of S. For matching gene trees and species trees, we investigate enumerative properties of compact coalescent histories. We report a recursion for the number of compact coalescent histories for matching gene trees and species trees, using it to study the numbers of compact coalescent histories for small trees. We show that the number of compact coalescent histories equals the number of coalescent histories if and only if the labeled topology is a caterpillar or a bicaterpillar. The number of compact coalescent histories is seen to increase with tree imbalance: we prove that as the number of taxa n increases, the exponential growth of the number of compact coalescent histories follows [Formula: see text] in the case of caterpillar or bicaterpillar labeled topologies and approximately [Formula: see text] and [Formula: see text] for lodgepole and balanced topologies, respectively. We prove that the mean number of compact coalescent histories of a labeled topology of size n selected uniformly at random grows with [Formula: see text]. Our results contribute to the analysis of the computational complexity of algorithms for computing gene tree probabilities, and to the combinatorial study of gene trees and species trees more generally.
Collapse
|
7
|
Disanto F, Rosenberg NA. On the Number of Non-equivalent Ancestral Configurations for Matching Gene Trees and Species Trees. Bull Math Biol 2017; 81:384-407. [PMID: 28913585 DOI: 10.1007/s11538-017-0342-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2017] [Accepted: 08/31/2017] [Indexed: 11/29/2022]
Abstract
An ancestral configuration is one of the combinatorially distinct sets of gene lineages that, for a given gene tree, can reach a given node of a specified species tree. Ancestral configurations have appeared in recursive algebraic computations of the conditional probability that a gene tree topology is produced under the multispecies coalescent model for a given species tree. For matching gene trees and species trees, we study the number of ancestral configurations, considered up to an equivalence relation introduced by Wu (Evolution 66:763-775, 2012) to reduce the complexity of the recursive probability computation. We examine the largest number of non-equivalent ancestral configurations possible for a given tree size n. Whereas the smallest number of non-equivalent ancestral configurations increases polynomially with n, we show that the largest number increases with [Formula: see text], where k is a constant that satisfies [Formula: see text]. Under a uniform distribution on the set of binary labeled trees with a given size n, the mean number of non-equivalent ancestral configurations grows exponentially with n. The results refine an earlier analysis of the number of ancestral configurations considered without applying the equivalence relation, showing that use of the equivalence relation does not alter the exponential nature of the increase with tree size.
Collapse
Affiliation(s)
- Filippo Disanto
- Department of Biology, Stanford University, Stanford, CA, USA. .,Department of Mathematics, University of Pisa, Pisa, Italy.
| | | |
Collapse
|
8
|
Inferring rooted species trees from unrooted gene trees using approximate Bayesian computation. Mol Phylogenet Evol 2017; 116:13-24. [PMID: 28780022 DOI: 10.1016/j.ympev.2017.07.017] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2016] [Revised: 03/26/2017] [Accepted: 07/22/2017] [Indexed: 02/01/2023]
Abstract
Methods for inferring species trees from gene trees motivated by incomplete lineage sorting typically use either rooted gene trees to infer a rooted species tree, or use unrooted gene trees to infer an unrooted species tree, which is then typically rooted using one or more outgroups. Theoretically, however, it has been known since 2011 that it is possible to consistently infer the root of the species tree directly from unrooted gene trees without assuming an outgroup. Here, we use approximate Bayesian computation to infer the root of the species tree from unrooted gene trees assuming the multispecies coalescent model. It is hoped that this approach will be useful in cases where an appropriate outgroup is difficult to find and gene trees do not follow a molecular clock. We use approximate Bayesian computation to infer the root of the species tree from unrooted gene trees. This approach could also be useful when there is prior information that makes a small number of root locations plausible in an unrooted species tree.
Collapse
|