1
|
Berling L, Collienne L, Gavryushkin A. Estimating the mean in the space of ranked phylogenetic trees. Bioinformatics 2024; 40:btae514. [PMID: 39177090 PMCID: PMC11364146 DOI: 10.1093/bioinformatics/btae514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Revised: 05/16/2024] [Accepted: 08/21/2024] [Indexed: 08/24/2024] Open
Abstract
MOTIVATION Reconstructing evolutionary histories of biological entities, such as genes, cells, organisms, populations, and species, from phenotypic and molecular sequencing data is central to many biological, palaeontological, and biomedical disciplines. Typically, due to uncertainties and incompleteness in data, the true evolutionary history (phylogeny) is challenging to estimate. Statistical modelling approaches address this problem by introducing and studying probability distributions over all possible evolutionary histories, but can also introduce uncertainties due to misspecification. In practice, computational methods are deployed to learn those distributions typically by sampling them. This approach, however, is fundamentally challenging as it requires designing and implementing various statistical methods over a space of phylogenetic trees (or treespace). Although the problem of developing statistics over a treespace has received substantial attention in the literature and numerous breakthroughs have been made, it remains largely unsolved. The challenge of solving this problem is 2-fold: a treespace has nontrivial often counter-intuitive geometry implying that much of classical Euclidean statistics does not immediately apply; many parametrizations of treespace with promising statistical properties are computationally hard, so they cannot be used in data analyses. As a result, there is no single conventional method for estimating even the most fundamental statistics over any treespace, such as mean and variance, and various heuristics are used in practice. Despite the existence of numerous tree summary methods to approximate means of probability distributions over a treespace based on its geometry, and the theoretical promise of this idea, none of the attempts resulted in a practical method for summarizing tree samples. RESULTS In this paper, we present a tree summary method along with useful properties of our chosen treespace while focusing on its impact on phylogenetic analyses of real datasets. We perform an extensive benchmark study and demonstrate that our method outperforms currently most popular methods with respect to a number of important 'quality' statistics. Further, we apply our method to three empirical datasets ranging from cancer evolution to linguistics and find novel insights into corresponding evolutionary problems in all of them. We hence conclude that this treespace is a promising candidate to serve as a foundation for developing statistics over phylogenetic trees analytically, as well as new computational tools for evolutionary data analyses. AVAILABILITY AND IMPLEMENTATION An implementation is available at https://github.com/bioDS/Centroid-Code.
Collapse
Affiliation(s)
- Lars Berling
- Biological Data Science Lab, School of Mathematics and Statistics, University of Canterbury, Christchurch 8041, New Zealand
| | - Lena Collienne
- Biological Data Science Lab, School of Mathematics and Statistics, University of Canterbury, Christchurch 8041, New Zealand
| | - Alex Gavryushkin
- Biological Data Science Lab, School of Mathematics and Statistics, University of Canterbury, Christchurch 8041, New Zealand
| |
Collapse
|
2
|
Samyak R, Palacios JA. Statistical summaries of unlabelled evolutionary trees. Biometrika 2024; 111:171-193. [PMID: 38352626 PMCID: PMC10861027 DOI: 10.1093/biomet/asad025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Indexed: 02/16/2024] Open
Abstract
Rooted and ranked phylogenetic trees are mathematical objects that are useful in modelling hierarchical data and evolutionary relationships with applications to many fields such as evolutionary biology and genetic epidemiology. Bayesian phylogenetic inference usually explores the posterior distribution of trees via Markov chain Monte Carlo methods. However, assessing uncertainty and summarizing distributions remains challenging for these types of structures. While labelled phylogenetic trees have been extensively studied, relatively less literature exists for unlabelled trees that are increasingly useful, for example when one seeks to summarize samples of trees obtained with different methods, or from different samples and environments, and wishes to assess the stability and generalizability of these summaries. In our paper, we exploit recently proposed distance metrics of unlabelled ranked binary trees and unlabelled ranked genealogies, or trees equipped with branch lengths, to define the Fréchet mean, variance and interquartile sets as summaries of these tree distributions. We provide an efficient combinatorial optimization algorithm for computing the Fréchet mean of a sample or of distributions on unlabelled ranked tree shapes and unlabelled ranked genealogies. We show the applicability of our summary statistics for studying popular tree distributions and for comparing the SARS-CoV-2 evolutionary trees across different locations during the COVID-19 epidemic in 2020. Our current implementations are publicly available at https://github.com/RSamyak/fmatrix.
Collapse
Affiliation(s)
- Rajanala Samyak
- Department of Statistics, Stanford University, 390 Jane Stanford Way, Stanford, California 94305, U.S.A
| | - Julia A Palacios
- Department of Statistics, Stanford University, 390 Jane Stanford Way, Stanford, California 94305, U.S.A
| |
Collapse
|
3
|
Pezo V, Jaziri F, Bourguignon PY, Louis D, Jacobs-Sera D, Rozenski J, Pochet S, Herdewijn P, Hatfull GF, Kaminski PA, Marliere P. Noncanonical DNA polymerization by aminoadenine-based siphoviruses. Science 2021; 372:520-524. [PMID: 33926956 DOI: 10.1126/science.abe6542] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2020] [Accepted: 03/25/2021] [Indexed: 01/05/2023]
Abstract
Bacteriophage genomes harbor the broadest chemical diversity of nucleobases across all life forms. Certain DNA viruses that infect hosts as diverse as cyanobacteria, proteobacteria, and actinobacteria exhibit wholesale substitution of aminoadenine for adenine, thereby forming three hydrogen bonds with thymine and violating Watson-Crick pairing rules. Aminoadenine-encoded DNA polymerases, homologous to the Klenow fragment of bacterial DNA polymerase I that includes 3'-exonuclease but lacks 5'-exonuclease, were found to preferentially select for aminoadenine instead of adenine in deoxynucleoside triphosphate incorporation templated by thymine. Polymerase genes occur in synteny with genes for a biosynthesis enzyme that produces aminoadenine deoxynucleotides in a wide array of Siphoviridae bacteriophages. Congruent phylogenetic clustering of the polymerases and biosynthesis enzymes suggests that aminoadenine has propagated in DNA alongside adenine since archaic stages of evolution.
Collapse
Affiliation(s)
- Valerie Pezo
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, 2 Rue Gaston Crémieux, 91057 Evry, France
| | - Faten Jaziri
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, 2 Rue Gaston Crémieux, 91057 Evry, France
| | - Pierre-Yves Bourguignon
- Werkstatt fuer Potenzielle Genetik, Naunynstrasse 30, 10997 Berlin, Germany.,TESSSI, 81 Rue Réaumur, 75002 Paris, France
| | | | - Deborah Jacobs-Sera
- Department of Biological Sciences, University of Pittsburgh, 4249 Fifth Avenue, Pittsburgh, PA 15260 USA
| | - Jef Rozenski
- Laboratory of Medicinal Chemistry, Rega Institute for Biomedical Research, KU Leuven, Herestraat 49, Box 1041, 3000 Leuven, Belgium
| | - Sylvie Pochet
- Organic Chemistry, CNRS UMR3523, Department of Chemistry and Biocatalysis, Institut Pasteur, 25-28 Rue du Docteur Roux, 75015 Paris, France
| | - Piet Herdewijn
- Laboratory of Medicinal Chemistry, Rega Institute for Biomedical Research, KU Leuven, Herestraat 49, Box 1041, 3000 Leuven, Belgium
| | - Graham F Hatfull
- Department of Biological Sciences, University of Pittsburgh, 4249 Fifth Avenue, Pittsburgh, PA 15260 USA
| | - Pierre-Alexandre Kaminski
- Biology of Gram-Positive Pathogens, CNRS URL3526, Department of Microbiology, Institut Pasteur, 25-28 Rue du Docteur Roux, 75015 Paris, France
| | - Philippe Marliere
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, 2 Rue Gaston Crémieux, 91057 Evry, France. .,TESSSI, 81 Rue Réaumur, 75002 Paris, France
| |
Collapse
|
4
|
Brown DG, Owen M. Mean and Variance of Phylogenetic Trees. Syst Biol 2019; 69:139-154. [DOI: 10.1093/sysbio/syz041] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2017] [Revised: 05/13/2019] [Accepted: 05/24/2019] [Indexed: 11/13/2022] Open
Abstract
Abstract
We describe the use of the Fréchet mean and variance in the Billera–Holmes–Vogtmann (BHV) treespace to summarize and explore the diversity of a set of phylogenetic trees. We show that the Fréchet mean is comparable to other summary methods, and, despite its stickiness property, is more likely to be binary than the majority-rule consensus tree. We show that the Fréchet variance is faster and more precise than commonly used variance measures. The Fréchet mean and variance are more theoretically justified, and more robust, than previous estimates of this type and can be estimated reasonably efficiently, providing a foundation for building more advanced statistical methods and leading to applications such as mean hypothesis testing and outlier detection.
Collapse
Affiliation(s)
- Daniel G Brown
- David R. Cheriton School of Computer Science, University of Waterloo, 200 University Ave. W, Waterloo ON N2L 3G1, Canada
| | - Megan Owen
- Department of Mathematics, Lehman College, City University of New York, 250 Bedford Park Blvd West, Bronx, New York, NY 10468, USA
| |
Collapse
|
6
|
Affiliation(s)
- Amy Willis
- Department of Biostatistics, University of Washington, Seattle, WA
| | - Rayna Bell
- Smithsonian Institution, National Museum of Natural History, Washington, DC
| |
Collapse
|
8
|
St. John K. Review Paper: The Shape of Phylogenetic Treespace. Syst Biol 2017; 66:e83-e94. [PMID: 28173538 PMCID: PMC5837343 DOI: 10.1093/sysbio/syw025] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2015] [Revised: 12/16/2015] [Accepted: 03/22/2016] [Indexed: 11/23/2022] Open
Abstract
Trees are a canonical structure for representing evolutionary histories. Many popular criteria used to infer optimal trees are computationally hard, and the number of possible tree shapes grows super-exponentially in the number of taxa. The underlying structure of the spaces of trees yields rich insights that can improve the search for optimal trees, both in accuracy and in running time, and the analysis and visualization of results. We review the past work on analyzing and comparing trees by their shape as well as recent work that incorporates trees with weighted branch lengths.
Collapse
Affiliation(s)
- Katherine St. John
- Department of Mathematics and Computer Science, Lehman College, NY 10034, USA
| |
Collapse
|
10
|
Gavryushkin A, Drummond AJ. The space of ultrametric phylogenetic trees. J Theor Biol 2016; 403:197-208. [PMID: 27188249 DOI: 10.1016/j.jtbi.2016.05.001] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2015] [Revised: 03/17/2016] [Accepted: 05/01/2016] [Indexed: 10/21/2022]
Abstract
The reliability of a phylogenetic inference method from genomic sequence data is ensured by its statistical consistency. Bayesian inference methods produce a sample of phylogenetic trees from the posterior distribution given sequence data. Hence the question of statistical consistency of such methods is equivalent to the consistency of the summary of the sample. More generally, statistical consistency is ensured by the tree space used to analyse the sample. In this paper, we consider two standard parameterisations of phylogenetic time-trees used in evolutionary models: inter-coalescent interval lengths and absolute times of divergence events. For each of these parameterisations we introduce a natural metric space on ultrametric phylogenetic trees. We compare the introduced spaces with existing models of tree space and formulate several formal requirements that a metric space on phylogenetic trees must possess in order to be a satisfactory space for statistical analysis, and justify them. We show that only a few known constructions of the space of phylogenetic trees satisfy these requirements. However, our results suggest that these basic requirements are not enough to distinguish between the two metric spaces we introduce and that the choice between metric spaces requires additional properties to be considered. Particularly, that the summary tree minimising the square distance to the trees from the sample might be different for different parameterisations. This suggests that further fundamental insight is needed into the problem of statistical consistency of phylogenetic inference methods.
Collapse
Affiliation(s)
- Alex Gavryushkin
- Centre for Computational Evolution, The University of Auckland, New Zealand.
| | - Alexei J Drummond
- Centre for Computational Evolution, The University of Auckland, New Zealand
| |
Collapse
|