1
|
Berling L, Collienne L, Gavryushkin A. Estimating the mean in the space of ranked phylogenetic trees. Bioinformatics 2024; 40:btae514. [PMID: 39177090 PMCID: PMC11364146 DOI: 10.1093/bioinformatics/btae514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Revised: 05/16/2024] [Accepted: 08/21/2024] [Indexed: 08/24/2024] Open
Abstract
MOTIVATION Reconstructing evolutionary histories of biological entities, such as genes, cells, organisms, populations, and species, from phenotypic and molecular sequencing data is central to many biological, palaeontological, and biomedical disciplines. Typically, due to uncertainties and incompleteness in data, the true evolutionary history (phylogeny) is challenging to estimate. Statistical modelling approaches address this problem by introducing and studying probability distributions over all possible evolutionary histories, but can also introduce uncertainties due to misspecification. In practice, computational methods are deployed to learn those distributions typically by sampling them. This approach, however, is fundamentally challenging as it requires designing and implementing various statistical methods over a space of phylogenetic trees (or treespace). Although the problem of developing statistics over a treespace has received substantial attention in the literature and numerous breakthroughs have been made, it remains largely unsolved. The challenge of solving this problem is 2-fold: a treespace has nontrivial often counter-intuitive geometry implying that much of classical Euclidean statistics does not immediately apply; many parametrizations of treespace with promising statistical properties are computationally hard, so they cannot be used in data analyses. As a result, there is no single conventional method for estimating even the most fundamental statistics over any treespace, such as mean and variance, and various heuristics are used in practice. Despite the existence of numerous tree summary methods to approximate means of probability distributions over a treespace based on its geometry, and the theoretical promise of this idea, none of the attempts resulted in a practical method for summarizing tree samples. RESULTS In this paper, we present a tree summary method along with useful properties of our chosen treespace while focusing on its impact on phylogenetic analyses of real datasets. We perform an extensive benchmark study and demonstrate that our method outperforms currently most popular methods with respect to a number of important 'quality' statistics. Further, we apply our method to three empirical datasets ranging from cancer evolution to linguistics and find novel insights into corresponding evolutionary problems in all of them. We hence conclude that this treespace is a promising candidate to serve as a foundation for developing statistics over phylogenetic trees analytically, as well as new computational tools for evolutionary data analyses. AVAILABILITY AND IMPLEMENTATION An implementation is available at https://github.com/bioDS/Centroid-Code.
Collapse
Affiliation(s)
- Lars Berling
- Biological Data Science Lab, School of Mathematics and Statistics, University of Canterbury, Christchurch 8041, New Zealand
| | - Lena Collienne
- Biological Data Science Lab, School of Mathematics and Statistics, University of Canterbury, Christchurch 8041, New Zealand
| | - Alex Gavryushkin
- Biological Data Science Lab, School of Mathematics and Statistics, University of Canterbury, Christchurch 8041, New Zealand
| |
Collapse
|
2
|
Teichman S, Lee MD, Willis AD. Analyzing microbial evolution through gene and genome phylogenies. Biostatistics 2024; 25:786-800. [PMID: 37897441 PMCID: PMC11247178 DOI: 10.1093/biostatistics/kxad025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 08/15/2023] [Accepted: 08/27/2023] [Indexed: 10/30/2023] Open
Abstract
Microbiome scientists critically need modern tools to explore and analyze microbial evolution. Often this involves studying the evolution of microbial genomes as a whole. However, different genes in a single genome can be subject to different evolutionary pressures, which can result in distinct gene-level evolutionary histories. To address this challenge, we propose to treat estimated gene-level phylogenies as data objects, and present an interactive method for the analysis of a collection of gene phylogenies. We use a local linear approximation of phylogenetic tree space to visualize estimated gene trees as points in low-dimensional Euclidean space, and address important practical limitations of existing related approaches, allowing an intuitive visualization of complex data objects. We demonstrate the utility of our proposed approach through microbial data analyses, including by identifying outlying gene histories in strains of Prevotella, and by contrasting Streptococcus phylogenies estimated using different gene sets. Our method is available as an open-source R package, and assists with estimating, visualizing, and interacting with a collection of bacterial gene phylogenies.
Collapse
Affiliation(s)
- Sarah Teichman
- University of Washington Department of Statistics, Box 354322, Seattle, WA 98195-4322, USA
| | - Michael D Lee
- KBR NASA Ames Research Center, PO Box 1, Moffett Field, CA 94035-1000
- Blue Marble Space Institute of Science, 600 1st Avenue, 1st Floor, Seattle, WA 98104, USA
| | - Amy D Willis
- University of Washington Department of Biostatistics, Hans Rosling Center for Population Health, Box 351617, Seattle, WA 98195-1617, USA
| |
Collapse
|
3
|
Samyak R, Palacios JA. Statistical summaries of unlabelled evolutionary trees. Biometrika 2024; 111:171-193. [PMID: 38352626 PMCID: PMC10861027 DOI: 10.1093/biomet/asad025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Indexed: 02/16/2024] Open
Abstract
Rooted and ranked phylogenetic trees are mathematical objects that are useful in modelling hierarchical data and evolutionary relationships with applications to many fields such as evolutionary biology and genetic epidemiology. Bayesian phylogenetic inference usually explores the posterior distribution of trees via Markov chain Monte Carlo methods. However, assessing uncertainty and summarizing distributions remains challenging for these types of structures. While labelled phylogenetic trees have been extensively studied, relatively less literature exists for unlabelled trees that are increasingly useful, for example when one seeks to summarize samples of trees obtained with different methods, or from different samples and environments, and wishes to assess the stability and generalizability of these summaries. In our paper, we exploit recently proposed distance metrics of unlabelled ranked binary trees and unlabelled ranked genealogies, or trees equipped with branch lengths, to define the Fréchet mean, variance and interquartile sets as summaries of these tree distributions. We provide an efficient combinatorial optimization algorithm for computing the Fréchet mean of a sample or of distributions on unlabelled ranked tree shapes and unlabelled ranked genealogies. We show the applicability of our summary statistics for studying popular tree distributions and for comparing the SARS-CoV-2 evolutionary trees across different locations during the COVID-19 epidemic in 2020. Our current implementations are publicly available at https://github.com/RSamyak/fmatrix.
Collapse
Affiliation(s)
- Rajanala Samyak
- Department of Statistics, Stanford University, 390 Jane Stanford Way, Stanford, California 94305, U.S.A
| | - Julia A Palacios
- Department of Statistics, Stanford University, 390 Jane Stanford Way, Stanford, California 94305, U.S.A
| |
Collapse
|
4
|
Teichman S, Lee MD, Willis AD. Analyzing microbial evolution through gene and genome phylogenies. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.15.553440. [PMID: 37645842 PMCID: PMC10462103 DOI: 10.1101/2023.08.15.553440] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/31/2023]
Abstract
Microbiome scientists critically need modern tools to explore and analyze microbial evolution. Often this involves studying the evolution of microbial genomes as a whole. However, different genes in a single genome can be subject to different evolutionary pressures, which can result in distinct gene-level evolutionary histories. To address this challenge, we propose to treat estimated gene-level phylogenies as data objects, and present an interactive method for the analysis of a collection of gene phylogenies. We use a local linear approximation of phylogenetic tree space to visualize estimated gene trees as points in low-dimensional Euclidean space, and address important practical limitations of existing related approaches, allowing an intuitive visualization of complex data objects. We demonstrate the utility of our proposed approach through microbial data analyses, including by identifying outlying gene histories in strains of Prevotella, and by contrasting Streptococcus phylogenies estimated using different gene sets. Our method is available as an open-source R package, and assists with estimating, visualizing and interacting with a collection of bacterial gene phylogenies. dimension reduction, microbiome, non-Euclidean, statistical genetics, visualization.
Collapse
Affiliation(s)
| | - Michael D Lee
- NASA Ames Research Center and Blue Marble Space Institute of Science
| | - Amy D Willis
- Department of Biostatistics, University of Washington
| |
Collapse
|
5
|
Li M, Park DE, Aziz M, Liu CM, Price LB, Wu Z. Integrating sample similarities into latent class analysis: a tree-structured shrinkage approach. Biometrics 2023; 79:264-279. [PMID: 34658017 PMCID: PMC10642217 DOI: 10.1111/biom.13580] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Revised: 07/23/2021] [Accepted: 10/05/2021] [Indexed: 11/27/2022]
Abstract
This paper is concerned with using multivariate binary observations to estimate the probabilities of unobserved classes with scientific meanings. We focus on the setting where additional information about sample similarities is available and represented by a rooted weighted tree. Every leaf in the given tree contains multiple samples. Shorter distances over the tree between the leaves indicate a priori higher similarity in class probability vectors. We propose a novel data integrative extension to classical latent class models with tree-structured shrinkage. The proposed approach enables (1) borrowing of information across leaves, (2) estimating data-driven leaf groups with distinct vectors of class probabilities, and (3) individual-level probabilistic class assignment given the observed multivariate binary measurements. We derive and implement a scalable posterior inference algorithm in a variational Bayes framework. Extensive simulations show more accurate estimation of class probabilities than alternatives that suboptimally use the additional sample similarity information. A zoonotic infectious disease application is used to illustrate the proposed approach. The paper concludes by a brief discussion on model limitations and extensions.
Collapse
Affiliation(s)
- Mengbing Li
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA
| | - Daniel E. Park
- Environmental and Occupational Health, Milken Institute School of Public Health, The George Washington University, Washington, District of Columbia, USA
| | - Maliha Aziz
- Environmental and Occupational Health, Milken Institute School of Public Health, The George Washington University, Washington, District of Columbia, USA
| | - Cindy M. Liu
- Environmental and Occupational Health, Milken Institute School of Public Health, The George Washington University, Washington, District of Columbia, USA
| | - Lance B. Price
- Environmental and Occupational Health, Milken Institute School of Public Health, The George Washington University, Washington, District of Columbia, USA
| | - Zhenke Wu
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA
- Michigan Institute for Data Science (MIDAS), University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
6
|
Cholaquidis A, Fraiman R, Gamboa F, Moreno L. Weighted lens depth: Some applications to supervised classification. CAN J STAT 2022. [DOI: 10.1002/cjs.11724] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Affiliation(s)
| | - Ricardo Fraiman
- Facultad de Ciencias, Universidad de la República Montevideo 11400 Uruguay
| | - Fabrice Gamboa
- Institut de Mathématiques de Toulouse Toulouse 31400 France
| | - Leonardo Moreno
- Facultad de Ciencias Económicas, Universidad de la República Montevideo 112002 Uruguay
| |
Collapse
|
7
|
Smith MR. Robust Analysis of Phylogenetic Tree Space. Syst Biol 2022; 71:1255-1270. [PMID: 34963003 PMCID: PMC9366458 DOI: 10.1093/sysbio/syab100] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Revised: 12/03/2021] [Accepted: 12/23/2021] [Indexed: 11/13/2022] Open
Abstract
Phylogenetic analyses often produce large numbers of trees. Mapping trees' distribution in "tree space" can illuminate the behavior and performance of search strategies, reveal distinct clusters of optimal trees, and expose differences between different data sources or phylogenetic methods-but the high-dimensional spaces defined by metric distances are necessarily distorted when represented in fewer dimensions. Here, I explore the consequences of this transformation in phylogenetic search results from 128 morphological data sets, using stratigraphic congruence-a complementary aspect of tree similarity-to evaluate the utility of low-dimensional mappings. I find that phylogenetic similarities between cladograms are most accurately depicted in tree spaces derived from information-theoretic tree distances or the quartet distance. Robinson-Foulds tree spaces exhibit prominent distortions and often fail to group trees according to phylogenetic similarity, whereas the strong influence of tree shape on the Kendall-Colijn distance makes its tree space unsuitable for many purposes. Distances mapped into two or even three dimensions often display little correspondence with true distances, which can lead to profound misrepresentation of clustering structure. Without explicit testing, one cannot be confident that a tree space mapping faithfully represents the true distribution of trees, nor that visually evident structure is valid. My recommendations for tree space validation and visualization are implemented in a new graphical user interface in the "TreeDist" R package. [Multidimensional scaling; phylogenetic software; tree distance metrics; treespace projections.].
Collapse
Affiliation(s)
- Martin R Smith
- Department of Earth Sciences, Durham University, Durham, UK
| |
Collapse
|
8
|
Weiskopf D. Uncertainty Visualization: Concepts, Methods, and Applications in Biological Data Visualization. FRONTIERS IN BIOINFORMATICS 2022; 2:793819. [PMID: 36304261 PMCID: PMC9580861 DOI: 10.3389/fbinf.2022.793819] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 01/14/2022] [Indexed: 11/23/2022] Open
Abstract
This paper provides an overview of uncertainty visualization in general, along with specific examples of applications in bioinformatics. Starting from a processing and interaction pipeline of visualization, components are discussed that are relevant for handling and visualizing uncertainty introduced with the original data and at later stages in the pipeline, which shows the importance of making the stages of the pipeline aware of uncertainty and allowing them to propagate uncertainty. We detail concepts and methods for visual mappings of uncertainty, distinguishing between explicit and implict representations of distributions, different ways to show summary statistics, and combined or hybrid visualizations. The basic concepts are illustrated for several examples of graph visualization under uncertainty. Finally, this review paper discusses implications for the visualization of biological data and future research directions.
Collapse
|
9
|
Abstract
Genealogical tree modeling is essential for estimating evolutionary parameters in population genetics and phylogenetics. Recent mathematical results concerning ranked genealogies without leaf labels unlock opportunities in the analysis of evolutionary trees. In particular, comparisons between ranked genealogies facilitate the study of evolutionary processes of different organisms sampled at multiple time periods. We propose metrics on ranked tree shapes and ranked genealogies for lineages isochronously and heterochronously sampled. Our proposed tree metrics make it possible to conduct statistical analyses of ranked tree shapes and timed ranked tree shapes or ranked genealogies. Such analyses allow us to assess differences in tree distributions, quantify estimation uncertainty, and summarize tree distributions. We show the utility of our metrics via simulations and an application in infectious diseases.
Collapse
Affiliation(s)
- Jaehee Kim
- Department of Biology, Stanford University, Stanford, CA 94305
| | | | - Julia A Palacios
- Department of Statistics, Stanford University, Stanford, CA 94305;
- Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA 94305
| |
Collapse
|