1
|
Janzen T, Etienne RS. Phylogenetic tree statistics: A systematic overview using the new R package 'treestats'. Mol Phylogenet Evol 2024:108168. [PMID: 39117295 DOI: 10.1016/j.ympev.2024.108168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 07/19/2024] [Accepted: 08/04/2024] [Indexed: 08/10/2024]
Abstract
Phylogenetic trees are believed to contain a wealth of information on diversification processes. However, comparing phylogenetic trees is not straightforward due to their high dimensionality. Researchers have therefore defined a wide range of low-dimensional summary statistics. Currently, it remains unexplored to what extent these summary statistics cover the same underlying information and what summary statistics best explain observed variation across phylogenies. Furthermore, a large subset of available summary statistics focusses on measuring the topological features of a phylogenetic tree, but are often only explored at the extreme edge cases of the fully balanced or imbalanced tree and not for trees of intermediate balance. Here, we introduce a new R package called 'treestats', that provides speed optimized code to compute 70 summary statistics. We study correlations between summary statistics on empirical trees and on trees simulated using several diversification models. Furthermore, we introduce an algorithm to create intermediately balanced trees in a well-defined manner, in order to explore variation in summary statistics across a balance gradient. We find that almost all summary statistics are correlated with tree size, and find that it is difficult, if not impossible, to correct for tree size, unless the tree generating model is known. Furthermore, we find that across empirical and simulated trees, at least three large clusters of correlated summary statistics can be found, where statistics group together based on information used (topology or branching times). However, the finer grained correlation structure appears to depend strongly on either the taxonomic group studied (in empirical studies) or the tree generating model (in simulation studies). Amongst statistics describing the (im)balance of a tree, we find that almost all statistics vary non-linearly, and sometimes even non-monotonically, with our generated balance gradient. This indicates that balance is perhaps a more complex property of a tree than previously thought. Furthermore, using our new imbalancing algorithm, we devise a numerical test to identify balance statistics, and identify several statistics as balance statistics that were not previously considered as such. Lastly, our results lead to several recommendations on which statistics to select when analyzing and comparing phylogenetic trees.
Collapse
Affiliation(s)
- Thijs Janzen
- Groningen Institute for Evolutionary Life Sciences, University of Groningen, Groningen, the Netherlands.
| | - Rampal S Etienne
- Groningen Institute for Evolutionary Life Sciences, University of Groningen, Groningen, the Netherlands
| |
Collapse
|
2
|
Khurana MP, Scheidwasser-Clow N, Penn MJ, Bhatt S, Duchêne DA. The Limits of the Constant-rate Birth-Death Prior for Phylogenetic Tree Topology Inference. Syst Biol 2024; 73:235-246. [PMID: 38153910 PMCID: PMC11129600 DOI: 10.1093/sysbio/syad075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 12/20/2023] [Accepted: 12/27/2023] [Indexed: 12/30/2023] Open
Abstract
Birth-death models are stochastic processes describing speciation and extinction through time and across taxa and are widely used in biology for inference of evolutionary timescales. Previous research has highlighted how the expected trees under the constant-rate birth-death (crBD) model tend to differ from empirical trees, for example, with respect to the amount of phylogenetic imbalance. However, our understanding of how trees differ between the crBD model and the signal in empirical data remains incomplete. In this Point of View, we aim to expose the degree to which the crBD model differs from empirically inferred phylogenies and test the limits of the model in practice. Using a wide range of topology indices to compare crBD expectations against a comprehensive dataset of 1189 empirically estimated trees, we confirm that crBD model trees frequently differ topologically compared with empirical trees. To place this in the context of standard practice in the field, we conducted a meta-analysis for a subset of the empirical studies. When comparing studies that used Bayesian methods and crBD priors with those that used other non-crBD priors and non-Bayesian methods (i.e., maximum likelihood methods), we do not find any significant differences in tree topology inferences. To scrutinize this finding for the case of highly imbalanced trees, we selected the 100 trees with the greatest imbalance from our dataset, simulated sequence data for these tree topologies under various evolutionary rates, and re-inferred the trees under maximum likelihood and using the crBD model in a Bayesian setting. We find that when the substitution rate is low, the crBD prior results in overly balanced trees, but the tendency is negligible when substitution rates are sufficiently high. Overall, our findings demonstrate the general robustness of crBD priors across a broad range of phylogenetic inference scenarios but also highlight that empirically observed phylogenetic imbalance is highly improbable under the crBD model, leading to systematic bias in data sets with limited information content.
Collapse
Affiliation(s)
- Mark P Khurana
- Section of Epidemiology, Department of Public Health, University of Copenhagen, 1352 Copenhagen, Denmark
| | - Neil Scheidwasser-Clow
- Section of Epidemiology, Department of Public Health, University of Copenhagen, 1352 Copenhagen, Denmark
| | - Matthew J Penn
- Department of Statistics, University of Oxford, OX1 3LB, Oxford, UK
| | - Samir Bhatt
- Section of Epidemiology, Department of Public Health, University of Copenhagen, 1352 Copenhagen, Denmark
- MRC Centre for Global Infectious Disease Analysis, School of Public Health, Imperial College London, SW7 2AZ, London, UK
| | - David A Duchêne
- Centre for Evolutionary Hologenomics, University of Copenhagen, 1352 Copenhagen, Denmark
| |
Collapse
|
3
|
Vukičević D, Matijević D. The Connection of the Generalized Robinson-Foulds Metric with Partial Wiener Indices. Acta Biotheor 2023; 71:5. [PMID: 36695929 DOI: 10.1007/s10441-023-09457-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2021] [Accepted: 01/11/2023] [Indexed: 01/26/2023]
Abstract
In this work we propose the partial Wiener index as one possible measure of branching in phylogenetic evolutionary trees. We establish the connection between the generalized Robinson-Foulds (RF) metric for measuring the similarity of phylogenetic trees and partial Wiener indices by expressing the number of conflicting pairs of edges in the generalized RF metric in terms of partial Wiener indices. To do so we compute the minimum and maximum value of the partial Wiener index [Formula: see text], where [Formula: see text] is a binary rooted tree with root [Formula: see text] and [Formula: see text] leaves. Moreover, under the Yule probabilistic model, we show how to compute the expected value of [Formula: see text]. As a direct consequence, we give exact formulas for the upper bound and the expected number of conflicting pairs. By doing so we provide a better theoretical understanding of the computational complexity of the generalized RF metric.
Collapse
Affiliation(s)
- Damir Vukičević
- Department of Mathematics, Faculty of Science, University of Split, Ruđera Boškovića 33, 21000, Split, Croatia.
| | - Domagoj Matijević
- Department of Mathematics, University of Osijek, Trg Lj. Gaja 6, 31000, Osijek, Croatia
| |
Collapse
|
4
|
Modrak V, Soltysova Z. Exploration of the optimal modularity in assembly line design. Sci Rep 2022; 12:20414. [PMID: 36437404 PMCID: PMC9701789 DOI: 10.1038/s41598-022-24972-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Accepted: 11/22/2022] [Indexed: 11/29/2022] Open
Abstract
It is widely accepted that a proper structural modularity degree of assembly processes in terms of mass customization has a positive effect on their efficiency because it, among other things, increases manufacturing flexibility and productivity. On the other hand, most practical approaches to identify such a degree is rather based on intuition or analytical reasoning than on scientific foundations. However, the first way can be used for simple assembly tasks, but in more complex assembly processes, this method lags behind the second. The purpose was to create a methodology for selection of optimal modular assembly model from among a predefined set of alternatives. The methodology is based on exploration of the relations between modularity measures and complexity issues as well as the relationship between structural modularity and symmetry. Especially, the linkage between modularity and complexity properties has been explored in order to show how modularization can affect distribution of the total structural complexity across the entire assembly line. To solve this selection problem, three different methods are preliminary suggested and compared via a series of numerical tests. The two of them present the novel contribution of this work, while the third method developed earlier for the purpose of finding and evaluating community structure in networks was adapted for a given application domain. Based on obtained results, one of these method is prioritized over another, since it offers more promising results and precision too.
Collapse
Affiliation(s)
- Vladimir Modrak
- grid.6903.c0000 0001 2235 0982Faculty of Manufacturing Technologies, Technical University of Kosice, 080 01 Pres̆ov, Slovakia
| | - Zuzana Soltysova
- grid.6903.c0000 0001 2235 0982Faculty of Manufacturing Technologies, Technical University of Kosice, 080 01 Pres̆ov, Slovakia
| |
Collapse
|
5
|
Two results about the Sackin and Colless indices for phylogenetic trees and their shapes. J Math Biol 2022; 85:69. [PMID: 36418585 DOI: 10.1007/s00285-022-01831-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2022] [Revised: 08/27/2022] [Accepted: 10/23/2022] [Indexed: 11/25/2022]
Abstract
The Sackin and Colless indices are two widely-used metrics for measuring the balance of trees and for testing evolutionary models in phylogenetics. This short paper contributes two results about the Sackin and Colless indices of trees. One result is the asymptotic analysis of the expected Sackin and Colless indices of tree shapes (which are full binary rooted unlabelled trees) under the uniform model where tree shapes are sampled with equal probability. Another is a short direct proof of the closed formula for the expected Sackin index of phylogenetic trees (which are full binary rooted trees with leaves being labelled with taxa) under the uniform model.
Collapse
|
6
|
Bienvenu F, Cardona G, Scornavacca C. Revisiting Shao and Sokal's [Formula: see text] index of phylogenetic balance. J Math Biol 2021; 83:52. [PMID: 34676444 DOI: 10.1007/s00285-021-01662-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2020] [Revised: 09/08/2021] [Accepted: 09/08/2021] [Indexed: 11/24/2022]
Abstract
Measures of phylogenetic balance, such as the Colless and Sackin indices, play an important role in phylogenetics. Unfortunately, these indices are specifically designed for phylogenetic trees, and do not extend naturally to phylogenetic networks (which are increasingly used to describe reticulate evolution). This led us to consider a lesser-known balance index, whose definition is based on a probabilistic interpretation that is equally applicable to trees and to networks. This index, known as the [Formula: see text] index, was first proposed by Shao and Sokal (Syst Zool 39(3): 266-276, 1990). Surprisingly, it does not seem to have been studied mathematically since. Likewise, it is used only sporadically in the biological literature, where it tends to be viewed as arcane. In this paper, we study mathematical properties of [Formula: see text] such as its expectation and variance under the most common models of random trees and its extremal values over various classes of phylogenetic networks. We also assess its relevance in biological applications, and find it to be comparable to that of the Colless and Sackin indices. Altogether, our results call for a reevaluation of the status of this somewhat forgotten measure of phylogenetic balance.
Collapse
Affiliation(s)
- François Bienvenu
- Institut des Sciences de l'Evolution de Montpellier, Université de Montpellier, CNRS, IRD, EPHE, F-34095, Montpellier, France. .,UMR AGAP, Université de Montpellier, CIRAD, INRAE, L'institut Agro, F-34398, Montpellier, France.
| | - Gabriel Cardona
- Institut des Sciences de l'Evolution de Montpellier, Université de Montpellier, CNRS, IRD, EPHE, F-34095, Montpellier, France.,Department of Mathematics and Computer Science, University of the Balearic Islands, Ctra. Valldemossa km 7.5, E-07120, Palma, Spain
| | - Celine Scornavacca
- Institut des Sciences de l'Evolution de Montpellier, Université de Montpellier, CNRS, IRD, EPHE, F-34095, Montpellier, France
| |
Collapse
|
7
|
King MC, Rosenberg NA. A simple derivation of the mean of the Sackin index of tree balance under the uniform model on rooted binary labeled trees. Math Biosci 2021; 342:108688. [PMID: 34537229 DOI: 10.1016/j.mbs.2021.108688] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Revised: 08/11/2021] [Accepted: 08/12/2021] [Indexed: 11/29/2022]
Abstract
In mathematical phylogenetics, the Sackin index, measuring the sum of path lengths between leaves and the root, is one of the most frequently used measures of balance for phylogenetic trees. The uniform model, in which all rooted binary labeled trees for a given set of leaf labels are assumed to be equiprobable, is one of the most frequently used models for describing a probability distribution on the set of rooted binary labeled trees. This note provides a simple new derivation of the mean value of the Sackin index of tree balance under the uniform model on rooted binary labeled trees. The new derivation suggests a simple form of the mean Sackin index in terms of the Catalan numbers, quickly enabling a verification of the asymptotic value for the mean.
Collapse
Affiliation(s)
- Matthew C King
- Department of Biology, Stanford University, Stanford, CA 94305, United States of America
| | - Noah A Rosenberg
- Department of Biology, Stanford University, Stanford, CA 94305, United States of America.
| |
Collapse
|
8
|
Measuring tree balance using symmetry nodes - A new balance index and its extremal properties. Math Biosci 2021; 341:108690. [PMID: 34433072 DOI: 10.1016/j.mbs.2021.108690] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Revised: 08/04/2021] [Accepted: 08/04/2021] [Indexed: 11/22/2022]
Abstract
Effects like selection in evolution as well as fertility inheritance in the development of populations can lead to a higher degree of asymmetry in evolutionary trees than expected under a null hypothesis. To identify and quantify such influences, various balance indices were proposed in the phylogenetic literature and have been in use for decades. However, so far no balance index was based on the number of symmetry nodes, even though symmetry nodes play an important role in other areas of mathematical phylogenetics and despite the fact that symmetry nodes are a quite natural way to measure balance or symmetry of a given tree. The aim of this manuscript is thus twofold: First, we will introduce the symmetry nodes index as an index for measuring balance of phylogenetic trees and analyze its extremal properties. We also show that this index can be calculated in linear time. This new index turns out to be a generalization of a simple and well-known balance index, namely the cherry index, as well as a specialization of another, less established, balance index, namely Rogers' J index. Thus, it is the second objective of the present manuscript to compare the new symmetry nodes index to these two indices and to underline its advantages. In order to do so, we will derive some extremal properties of the cherry index and Rogers' J index along the way and thus complement existing studies on these indices. Moreover, we used the programming language R to implement all three indices in the software package symmeTree, which has been made publicly available.
Collapse
|
9
|
Squaring within the Colless index yields a better balance index. Math Biosci 2020; 331:108503. [PMID: 33253745 DOI: 10.1016/j.mbs.2020.108503] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2020] [Revised: 10/28/2020] [Accepted: 10/28/2020] [Indexed: 11/24/2022]
Abstract
The Colless index for bifurcating phylogenetic trees, introduced by Colless (1982), is defined as the sum, over all internal nodes v of the tree, of the absolute value of the difference of the sizes of the clades defined by the children of v. It is one of the most popular phylogenetic balance indices, because, in addition to measuring the balance of a tree in a very simple and intuitive way, it turns out to be one of the most powerful and discriminating phylogenetic shape indices. But it has some drawbacks. On the one hand, although its minimum value is reached at the so-called maximally balanced trees, it is almost always reached also at trees that are not maximally balanced. On the other hand, its definition as a sum of absolute values of differences makes it difficult to study analytically its distribution under probabilistic models of bifurcating phylogenetic trees. In this paper we show that if we replace in its definition the absolute values of the differences of clade sizes by the squares of these differences, all these drawbacks are overcome and the resulting index is still more powerful and discriminating than the original Colless index.
Collapse
|