1
|
Berling L, Collienne L, Gavryushkin A. Estimating the mean in the space of ranked phylogenetic trees. Bioinformatics 2024; 40:btae514. [PMID: 39177090 PMCID: PMC11364146 DOI: 10.1093/bioinformatics/btae514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Revised: 05/16/2024] [Accepted: 08/21/2024] [Indexed: 08/24/2024] Open
Abstract
MOTIVATION Reconstructing evolutionary histories of biological entities, such as genes, cells, organisms, populations, and species, from phenotypic and molecular sequencing data is central to many biological, palaeontological, and biomedical disciplines. Typically, due to uncertainties and incompleteness in data, the true evolutionary history (phylogeny) is challenging to estimate. Statistical modelling approaches address this problem by introducing and studying probability distributions over all possible evolutionary histories, but can also introduce uncertainties due to misspecification. In practice, computational methods are deployed to learn those distributions typically by sampling them. This approach, however, is fundamentally challenging as it requires designing and implementing various statistical methods over a space of phylogenetic trees (or treespace). Although the problem of developing statistics over a treespace has received substantial attention in the literature and numerous breakthroughs have been made, it remains largely unsolved. The challenge of solving this problem is 2-fold: a treespace has nontrivial often counter-intuitive geometry implying that much of classical Euclidean statistics does not immediately apply; many parametrizations of treespace with promising statistical properties are computationally hard, so they cannot be used in data analyses. As a result, there is no single conventional method for estimating even the most fundamental statistics over any treespace, such as mean and variance, and various heuristics are used in practice. Despite the existence of numerous tree summary methods to approximate means of probability distributions over a treespace based on its geometry, and the theoretical promise of this idea, none of the attempts resulted in a practical method for summarizing tree samples. RESULTS In this paper, we present a tree summary method along with useful properties of our chosen treespace while focusing on its impact on phylogenetic analyses of real datasets. We perform an extensive benchmark study and demonstrate that our method outperforms currently most popular methods with respect to a number of important 'quality' statistics. Further, we apply our method to three empirical datasets ranging from cancer evolution to linguistics and find novel insights into corresponding evolutionary problems in all of them. We hence conclude that this treespace is a promising candidate to serve as a foundation for developing statistics over phylogenetic trees analytically, as well as new computational tools for evolutionary data analyses. AVAILABILITY AND IMPLEMENTATION An implementation is available at https://github.com/bioDS/Centroid-Code.
Collapse
Affiliation(s)
- Lars Berling
- Biological Data Science Lab, School of Mathematics and Statistics, University of Canterbury, Christchurch 8041, New Zealand
| | - Lena Collienne
- Biological Data Science Lab, School of Mathematics and Statistics, University of Canterbury, Christchurch 8041, New Zealand
| | - Alex Gavryushkin
- Biological Data Science Lab, School of Mathematics and Statistics, University of Canterbury, Christchurch 8041, New Zealand
| |
Collapse
|
2
|
Teichman S, Lee MD, Willis AD. Analyzing microbial evolution through gene and genome phylogenies. Biostatistics 2024; 25:786-800. [PMID: 37897441 PMCID: PMC11247178 DOI: 10.1093/biostatistics/kxad025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 08/15/2023] [Accepted: 08/27/2023] [Indexed: 10/30/2023] Open
Abstract
Microbiome scientists critically need modern tools to explore and analyze microbial evolution. Often this involves studying the evolution of microbial genomes as a whole. However, different genes in a single genome can be subject to different evolutionary pressures, which can result in distinct gene-level evolutionary histories. To address this challenge, we propose to treat estimated gene-level phylogenies as data objects, and present an interactive method for the analysis of a collection of gene phylogenies. We use a local linear approximation of phylogenetic tree space to visualize estimated gene trees as points in low-dimensional Euclidean space, and address important practical limitations of existing related approaches, allowing an intuitive visualization of complex data objects. We demonstrate the utility of our proposed approach through microbial data analyses, including by identifying outlying gene histories in strains of Prevotella, and by contrasting Streptococcus phylogenies estimated using different gene sets. Our method is available as an open-source R package, and assists with estimating, visualizing, and interacting with a collection of bacterial gene phylogenies.
Collapse
Affiliation(s)
- Sarah Teichman
- University of Washington Department of Statistics, Box 354322, Seattle, WA 98195-4322, USA
| | - Michael D Lee
- KBR NASA Ames Research Center, PO Box 1, Moffett Field, CA 94035-1000
- Blue Marble Space Institute of Science, 600 1st Avenue, 1st Floor, Seattle, WA 98104, USA
| | - Amy D Willis
- University of Washington Department of Biostatistics, Hans Rosling Center for Population Health, Box 351617, Seattle, WA 98195-1617, USA
| |
Collapse
|
3
|
Smith MR. Robust Analysis of Phylogenetic Tree Space. Syst Biol 2022; 71:1255-1270. [PMID: 34963003 PMCID: PMC9366458 DOI: 10.1093/sysbio/syab100] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Revised: 12/03/2021] [Accepted: 12/23/2021] [Indexed: 11/13/2022] Open
Abstract
Phylogenetic analyses often produce large numbers of trees. Mapping trees' distribution in "tree space" can illuminate the behavior and performance of search strategies, reveal distinct clusters of optimal trees, and expose differences between different data sources or phylogenetic methods-but the high-dimensional spaces defined by metric distances are necessarily distorted when represented in fewer dimensions. Here, I explore the consequences of this transformation in phylogenetic search results from 128 morphological data sets, using stratigraphic congruence-a complementary aspect of tree similarity-to evaluate the utility of low-dimensional mappings. I find that phylogenetic similarities between cladograms are most accurately depicted in tree spaces derived from information-theoretic tree distances or the quartet distance. Robinson-Foulds tree spaces exhibit prominent distortions and often fail to group trees according to phylogenetic similarity, whereas the strong influence of tree shape on the Kendall-Colijn distance makes its tree space unsuitable for many purposes. Distances mapped into two or even three dimensions often display little correspondence with true distances, which can lead to profound misrepresentation of clustering structure. Without explicit testing, one cannot be confident that a tree space mapping faithfully represents the true distribution of trees, nor that visually evident structure is valid. My recommendations for tree space validation and visualization are implemented in a new graphical user interface in the "TreeDist" R package. [Multidimensional scaling; phylogenetic software; tree distance metrics; treespace projections.].
Collapse
Affiliation(s)
- Martin R Smith
- Department of Earth Sciences, Durham University, Durham, UK
| |
Collapse
|
4
|
Wu X, Zhu H. Association testing for binary trees-A Markov branching process approach. Stat Med 2022; 41:2557-2573. [PMID: 35262202 PMCID: PMC9311163 DOI: 10.1002/sim.9370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Revised: 01/28/2022] [Accepted: 02/22/2022] [Indexed: 11/29/2022]
Abstract
We propose a new approach to test associations between binary trees and covariates. In this approach, binary-tree structured data are treated as sample paths of binary fission Markov branching processes (bMBP). We propose a generalized linear regression model and developed inference procedures for association testing, including variable selection and estimation of covariate effects. Simulation studies show that these procedures are able to accurately identify covariates that are associated with the binary tree structure by impacting the rate parameter of the bMBP. The problem of association testing on binary trees is motivated by modeling hierarchical clustering dendrograms of pixel intensities in biomedical images. By using semi-synthetic data generated from a real brain-tumor image, our simulation studies show that the bMBP model is able to capture the characteristics of dendrogram trees in brain-tumor images. Our final analysis of the glioblastoma multiforme brain-tumor data from The Cancer Imaging Archive identified multiple clinical and genetic variables that are potentially associated with brain-tumor heterogeneity.
Collapse
Affiliation(s)
- Xiaowei Wu
- Department of StatisticsVirginia TechBlacksburgVirginiaUSA
| | - Hongxiao Zhu
- Department of StatisticsVirginia TechBlacksburgVirginiaUSA
| |
Collapse
|
5
|
Baczyński J, Sauquet H, Spalik K. Exceptional evolutionary lability of flower-like inflorescences (pseudanthia) in Apiaceae subfamily Apioideae. AMERICAN JOURNAL OF BOTANY 2022; 109:437-455. [PMID: 35112711 PMCID: PMC9310750 DOI: 10.1002/ajb2.1819] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/27/2021] [Revised: 12/19/2021] [Accepted: 12/22/2021] [Indexed: 06/14/2023]
Abstract
PREMISE Pseudanthia are widespread and have long been postulated to be a key innovation responsible for some of the angiosperm radiations. The aim of our study was to analyze macroevolutionary patterns of these flower-like inflorescences and their potential correlation with diversification rates in Apiaceae subfamily Apioideae. In particular, we were interested to investigate evolvability of pseudanthia and evaluate their potential association with changes in the size of floral display. METHODS The framework for our analyses consisted of a time-calibrated phylogeny of 1734 representatives of Apioideae and a morphological matrix of inflorescence traits encoded for 847 species. Macroevolutionary patterns in pseudanthia were inferred using Markov models of discrete character evolution and stochastic character mapping, and a principal component analysis was used to visualize correlations in inflorescence architecture. The interdependence between net diversification rates and the occurrence of pseudocorollas was analyzed with trait-independent and trait-dependent approaches. RESULTS Pseudanthia evolved in 10 major clades of Apioideae with at least 36 independent origins and 46 reversals. The morphospace analysis recovered differences in color and compactness between floral and hyperfloral pseudanthia. A correlation between pseudocorollas and size of inflorescence was also strongly supported. Contrary to our predictions, pseudanthia are not responsible for variation in diversification rates identified in this subfamily. CONCLUSIONS Our results suggest that pseudocorollas evolve as an answer to the trade-off between enlargement of floral display and costs associated with production of additional flowers. The high evolvability and architectural differences in apioid pseudanthia may be explained on the basis of adaptive wandering and evolutionary developmental biology.
Collapse
Affiliation(s)
- Jakub Baczyński
- Institute of Evolutionary Biology, Faculty of BiologyUniversity of Warsaw Biological and Chemical Research CentreWarsawPoland
| | - Hervé Sauquet
- National Herbarium of New South Wales (NSW)Royal Botanic Gardens and Domain TrustSydneyNSW2000Australia
- Evolution and Ecology Research Centre, School of Biological, Earth and Environmental SciencesUniversity of New South WalesSydneyAustralia
| | - Krzysztof Spalik
- Institute of Evolutionary Biology, Faculty of BiologyUniversity of Warsaw Biological and Chemical Research CentreWarsawPoland
| |
Collapse
|
6
|
Abstract
We propose a new space of phylogenetic trees which we call wald space. The motivation is to develop a space suitable for statistical analysis of phylogenies, but with a geometry based on more biologically principled assumptions than existing spaces: in wald space, trees are close if they induce similar distributions on genetic sequence data. As a point set, wald space contains the previously developed Billera–Holmes–Vogtmann (BHV) tree space; it also contains disconnected forests, like the edge-product (EP) space but without certain singularities of the EP space. We investigate two related geometries on wald space. The first is the geometry of the Fisher information metric of character distributions induced by the two-state symmetric Markov substitution process on each tree. Infinitesimally, the metric is proportional to the Kullback–Leibler divergence, or equivalently, as we show, to any f-divergence. The second geometry is obtained analogously but using a related continuous-valued Gaussian process on each tree, and it can be viewed as the trace metric of the affine-invariant metric for covariance matrices. We derive a gradient descent algorithm to project from the ambient space of covariance matrices to wald space. For both geometries we derive computational methods to compute geodesics in polynomial time and show numerically that the two information geometries (discrete and continuous) are very similar. In particular, geodesics are approximated extrinsically. Comparison with the BHV geometry shows that our canonical and biologically motivated space is substantially different.
Collapse
|
7
|
Page R, Yoshida R, Zhang L. Tropical principal component analysis on the space of phylogenetic trees. Bioinformatics 2020; 36:4590-4598. [PMID: 32516398 DOI: 10.1093/bioinformatics/btaa564] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2019] [Revised: 05/29/2020] [Accepted: 06/03/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Due to new technology for efficiently generating genome data, machine learning methods are urgently needed to analyze large sets of gene trees over the space of phylogenetic trees. However, the space of phylogenetic trees is not Euclidean, so ordinary machine learning methods cannot be directly applied. In 2019, Yoshida et al. introduced the notion of tropical principal component analysis (PCA), a statistical method for visualization and dimensionality reduction using a tropical polytope with a fixed number of vertices that minimizes the sum of tropical distances between each data point and its tropical projection. However, their work focused on the tropical projective space rather than the space of phylogenetic trees. We focus here on tropical PCA for dimension reduction and visualization over the space of phylogenetic trees. RESULTS Our main results are 2-fold: (i) theoretical interpretations of the tropical principal components over the space of phylogenetic trees, namely, the existence of a tropical cell decomposition into regions of fixed tree topology; and (ii) the development of a stochastic optimization method to estimate tropical PCs over the space of phylogenetic trees using a Markov Chain Monte Carlo approach. This method performs well with simulation studies, and it is applied to three empirical datasets: Apicomplexa and African coelacanth genomes as well as sequences of hemagglutinin for influenza from New York. AVAILABILITY AND IMPLEMENTATION Dataset: http://polytopes.net/Data.tar.gz. Code: http://polytopes.net/tropica_MCMC_codes.tar.gz. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Robert Page
- Department of Operations Research, Naval Postgraduate School, Monterey, CA 93943, USA
| | - Ruriko Yoshida
- Department of Operations Research, Naval Postgraduate School, Monterey, CA 93943, USA
| | - Leon Zhang
- Department of Mathematics, University of California, Berkeley, Berkeley, CA 94720, USA
| |
Collapse
|
8
|
|
9
|
Brown DG, Owen M. Mean and Variance of Phylogenetic Trees. Syst Biol 2019; 69:139-154. [DOI: 10.1093/sysbio/syz041] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2017] [Revised: 05/13/2019] [Accepted: 05/24/2019] [Indexed: 11/13/2022] Open
Abstract
Abstract
We describe the use of the Fréchet mean and variance in the Billera–Holmes–Vogtmann (BHV) treespace to summarize and explore the diversity of a set of phylogenetic trees. We show that the Fréchet mean is comparable to other summary methods, and, despite its stickiness property, is more likely to be binary than the majority-rule consensus tree. We show that the Fréchet variance is faster and more precise than commonly used variance measures. The Fréchet mean and variance are more theoretically justified, and more robust, than previous estimates of this type and can be estimated reasonably efficiently, providing a foundation for building more advanced statistical methods and leading to applications such as mean hypothesis testing and outlier detection.
Collapse
Affiliation(s)
- Daniel G Brown
- David R. Cheriton School of Computer Science, University of Waterloo, 200 University Ave. W, Waterloo ON N2L 3G1, Canada
| | - Megan Owen
- Department of Mathematics, Lehman College, City University of New York, 250 Bedford Park Blvd West, Bronx, New York, NY 10468, USA
| |
Collapse
|
10
|
Schötz C. Convergence rates for the generalized Fréchet mean via the quadruple inequality. Electron J Stat 2019. [DOI: 10.1214/19-ejs1618] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
11
|
Affiliation(s)
- Amy Willis
- Department of Biostatistics, University of Washington, Seattle, WA
| |
Collapse
|
12
|
Dinh V, Tung Ho LS, Suchard MA, Matsen FA. Consistency and convergence rate of phylogenetic inference via regularization. Ann Stat 2018; 46:1481-1512. [PMID: 30344357 DOI: 10.1214/17-aos1592] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
It is common in phylogenetics to have some, perhaps partial, information about the overall evolutionary tree of a group of organisms and wish to find an evolutionary tree of a specific gene for those organisms. There may not be enough information in the gene sequences alone to accurately reconstruct the correct "gene tree." Although the gene tree may deviate from the "species tree" due to a variety of genetic processes, in the absence of evidence to the contrary it is parsimonious to assume that they agree. A common statistical approach in these situations is to develop a likelihood penalty to incorporate such additional information. Recent studies using simulation and empirical data suggest that a likelihood penalty quantifying concordance with a species tree can significantly improve the accuracy of gene tree reconstruction compared to using sequence data alone. However, the consistency of such an approach has not yet been established, nor have convergence rates been bounded. Because phylogenetics is a non-standard inference problem, the standard theory does not apply. In this paper, we propose a penalized maximum likelihood estimator for gene tree reconstruction, where the penalty is the square of the Billera-Holmes-Vogtmann geodesic distance from the gene tree to the species tree. We prove that this method is consistent, and derive its convergence rate for estimating the discrete gene tree structure and continuous edge lengths (representing the amount of evolution that has occurred on that branch) simultaneously. We find that the regularized estimator is "adaptive fast converging," meaning that it can reconstruct all edges of length greater than any given threshold from gene sequences of polynomial length. Our method does not require the species tree to be known exactly; in fact, our asymptotic theory holds for any such guide tree.
Collapse
Affiliation(s)
- Vu Dinh
- Program in Computational Biology Fred Hutchinson Cancer Research Center
| | - Lam Si Tung Ho
- Department of Biostatistics University of California, Los Angeles
| | - Marc A Suchard
- Departments of Biomathematics, Biostatistics and Human Genetics University of California, Los Angeles
| | | |
Collapse
|
13
|
Affiliation(s)
- Amy Willis
- Department of Biostatistics, University of Washington, Seattle, WA
| | - Rayna Bell
- Smithsonian Institution, National Museum of Natural History, Washington, DC
| |
Collapse
|
14
|
Liebscher V. New Gromov-Inspired Metrics on Phylogenetic Tree Space. Bull Math Biol 2018; 80:493-518. [DOI: 10.1007/s11538-017-0385-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2017] [Accepted: 12/19/2017] [Indexed: 11/29/2022]
|
15
|
Jombart T, Kendall M, Almagro‐Garcia J, Colijn C. treespace: Statistical exploration of landscapes of phylogenetic trees. Mol Ecol Resour 2017; 17:1385-1392. [PMID: 28374552 PMCID: PMC5724650 DOI: 10.1111/1755-0998.12676] [Citation(s) in RCA: 92] [Impact Index Per Article: 13.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2016] [Revised: 03/17/2017] [Accepted: 03/21/2017] [Indexed: 01/01/2023]
Abstract
The increasing availability of large genomic data sets as well as the advent of Bayesian phylogenetics facilitates the investigation of phylogenetic incongruence, which can result in the impossibility of representing phylogenetic relationships using a single tree. While sometimes considered as a nuisance, phylogenetic incongruence can also reflect meaningful biological processes as well as relevant statistical uncertainty, both of which can yield valuable insights in evolutionary studies. We introduce a new tool for investigating phylogenetic incongruence through the exploration of phylogenetic tree landscapes. Our approach, implemented in the R package treespace, combines tree metrics and multivariate analysis to provide low-dimensional representations of the topological variability in a set of trees, which can be used for identifying clusters of similar trees and group-specific consensus phylogenies. treespace also provides a user-friendly web interface for interactive data analysis and is integrated alongside existing standards for phylogenetics. It fills a gap in the current phylogenetics toolbox in R and will facilitate the investigation of phylogenetic results.
Collapse
Affiliation(s)
- Thibaut Jombart
- Department of Infectious Disease EpidemiologyMRC Centre for Outbreak Analysis and ModellingSchool of Public HealthImperial College LondonLondonUK
| | | | | | | |
Collapse
|
16
|
Nye TMW, Tang X, Weyenberg G, Yoshida R. Principal component analysis and the locus of the Fréchet mean in the space of phylogenetic trees. Biometrika 2017; 104:901-922. [PMID: 29422694 PMCID: PMC5793493 DOI: 10.1093/biomet/asx047] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2016] [Indexed: 11/13/2022] Open
Abstract
Evolutionary relationships are represented by phylogenetic trees, and a phylogenetic
analysis of gene sequences typically produces a collection of these trees, one for each
gene in the analysis. Analysis of samples of trees is difficult due to the
multi-dimensionality of the space of possible trees. In Euclidean spaces, principal
component analysis is a popular method of reducing high-dimensional data to a
low-dimensional representation that preserves much of the sample’s structure. However, the
space of all phylogenetic trees on a fixed set of species does not form a Euclidean vector
space, and methods adapted to tree space are needed. Previous work introduced the notion
of a principal geodesic in this space, analogous to the first principal component. Here we
propose a geometric object for tree space similar to the \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$k$\end{document}th
principal component in Euclidean space: the locus of the weighted Fréchet mean of
\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$k+1$\end{document} vertex trees when the weights vary over
the \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$k$\end{document}-simplex. We establish some basic properties
of these objects, in particular showing that they have dimension \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$k$\end{document}, and
propose algorithms for projection onto these surfaces and for finding the principal locus
associated with a sample of trees. Simulation studies demonstrate that these algorithms
perform well, and analyses of two datasets, containing Apicomplexa and African coelacanth
genomes respectively, reveal important structure from the second principal components.
Collapse
Affiliation(s)
- Tom M W Nye
- School of Mathematics and Statistics, Newcastle University, Newcastle upon Tyne NE1 7RU,
| | - Xiaoxian Tang
- Department of Mathematics, Texas A&M University, College Station, Texas 77843,
| | - Grady Weyenberg
- Department of Mathematics, University of Hawaii at Hilo, Hilo, Hawaii 96720,
| | - Ruriko Yoshida
- Department of Operations Research, Naval Postgraduate School, Monterey, California 93943,
| |
Collapse
|
17
|
Groisser D, Jung S, Schwartzman A. Geometric foundations for scaling-rotation statistics on symmetric positive definite matrices: Minimal smooth scaling-rotation curves in low dimensions. Electron J Stat 2017. [DOI: 10.1214/17-ejs1250] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
18
|
St. John K. Review Paper: The Shape of Phylogenetic Treespace. Syst Biol 2017; 66:e83-e94. [PMID: 28173538 PMCID: PMC5837343 DOI: 10.1093/sysbio/syw025] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2015] [Revised: 12/16/2015] [Accepted: 03/22/2016] [Indexed: 11/23/2022] Open
Abstract
Trees are a canonical structure for representing evolutionary histories. Many popular criteria used to infer optimal trees are computationally hard, and the number of possible tree shapes grows super-exponentially in the number of taxa. The underlying structure of the spaces of trees yields rich insights that can improve the search for optimal trees, both in accuracy and in running time, and the analysis and visualization of results. We review the past work on analyzing and comparing trees by their shape as well as recent work that incorporates trees with weighted branch lengths.
Collapse
Affiliation(s)
- Katherine St. John
- Department of Mathematics and Computer Science, Lehman College, NY 10034, USA
| |
Collapse
|
19
|
Barden D, Le H, Owen M. Limiting behaviour of Fréchet means in the space of phylogenetic trees. ANN I STAT MATH 2016. [DOI: 10.1007/s10463-016-0582-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
20
|
Lu N, Miao H. Clustering Tree-Structured Data on Manifold. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2016; 38:1956-1968. [PMID: 26660696 PMCID: PMC5027669 DOI: 10.1109/tpami.2015.2505282] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Tree-structured data usually contain both topological and geometrical information, and are necessarily considered on manifold instead of euclidean space for appropriate data parameterization and analysis. In this study, we propose a novel tree-structured data parameterization, called Topology-Attribute matrix (T-A matrix), so the data clustering task can be conducted on matrix manifold. We incorporate the structure constraints embedded in data into the non-negative matrix factorization method to determine meta-trees from the T-A matrix, and the signature vector of each single tree can then be extracted by meta-tree decomposition. The meta-tree space turns out to be a cone space, in which we explore the distance metric and implement the clustering algorithm based on the concepts like Fréchet mean. Finally, the T-A matrix based clustering (TAMBAC) framework is evaluated and compared using both simulated data and real retinal images to illustrate its efficiency and accuracy.
Collapse
Affiliation(s)
- Na Lu
- State Key Laboratory for Manufacturing Systems Engineering, Systems Engineering Institute, Xi’an Jiaotong University, Xi’an, Shaanxi,China, 710049.
| | - Hongyu Miao
- Department of Biostatistics, School of Public Health, University of Texas Health Science Center at Houston, , Houston, TX, USA, 77030.
| |
Collapse
|
21
|
Gavryushkin A, Drummond AJ. The space of ultrametric phylogenetic trees. J Theor Biol 2016; 403:197-208. [PMID: 27188249 DOI: 10.1016/j.jtbi.2016.05.001] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2015] [Revised: 03/17/2016] [Accepted: 05/01/2016] [Indexed: 10/21/2022]
Abstract
The reliability of a phylogenetic inference method from genomic sequence data is ensured by its statistical consistency. Bayesian inference methods produce a sample of phylogenetic trees from the posterior distribution given sequence data. Hence the question of statistical consistency of such methods is equivalent to the consistency of the summary of the sample. More generally, statistical consistency is ensured by the tree space used to analyse the sample. In this paper, we consider two standard parameterisations of phylogenetic time-trees used in evolutionary models: inter-coalescent interval lengths and absolute times of divergence events. For each of these parameterisations we introduce a natural metric space on ultrametric phylogenetic trees. We compare the introduced spaces with existing models of tree space and formulate several formal requirements that a metric space on phylogenetic trees must possess in order to be a satisfactory space for statistical analysis, and justify them. We show that only a few known constructions of the space of phylogenetic trees satisfy these requirements. However, our results suggest that these basic requirements are not enough to distinguish between the two metric spaces we introduce and that the choice between metric spaces requires additional properties to be considered. Particularly, that the summary tree minimising the square distance to the trees from the sample might be different for different parameterisations. This suggests that further fundamental insight is needed into the problem of statistical consistency of phylogenetic inference methods.
Collapse
Affiliation(s)
- Alex Gavryushkin
- Centre for Computational Evolution, The University of Auckland, New Zealand.
| | - Alexei J Drummond
- Centre for Computational Evolution, The University of Auckland, New Zealand
| |
Collapse
|
22
|
Bendich P, Marron JS, Miller E, Pieloch A, Skwerer S. Persistent Homology Analysis of Brain Artery Trees. Ann Appl Stat 2016; 10:198-218. [PMID: 27642379 PMCID: PMC5026243 DOI: 10.1214/15-aoas886] [Citation(s) in RCA: 60] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
New representations of tree-structured data objects, using ideas from topological data analysis, enable improved statistical analyses of a population of brain artery trees. A number of representations of each data tree arise from persistence diagrams that quantify branching and looping of vessels at multiple scales. Novel approaches to the statistical analysis, through various summaries of the persistence diagrams, lead to heightened correlations with covariates such as age and sex, relative to earlier analyses of this data set. The correlation with age continues to be significant even after controlling for correlations from earlier significant summaries.
Collapse
Affiliation(s)
- Paul Bendich
- Department of Mathematics, Duke University, Durham, North Carolina 27708, USA
| | - J. S. Marron
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, North Carolina 27599, USA
| | - Ezra Miller
- Department of Mathematics, Duke University, Durham, North Carolina 27708, USA
| | - Alex Pieloch
- Department of Mathematics, Duke University, Durham, North Carolina 27708, USA
| | - Sean Skwerer
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut 06510, USA
| |
Collapse
|
23
|
|
24
|
Abstract
Motivation: The construction of statistics for summarizing posterior samples returned by a Bayesian phylogenetic study has so far been hindered by the poor geometric insights available into the space of phylogenetic trees, and ad hoc methods such as the derivation of a consensus tree makeup for the ill-definition of the usual concepts of posterior mean, while bootstrap methods mitigate the absence of a sound concept of variance. Yielding satisfactory results with sufficiently concentrated posterior distributions, such methods fall short of providing a faithful summary of posterior distributions if the data do not offer compelling evidence for a single topology. Results: Building upon previous work of Billera et al., summary statistics such as sample mean, median and variance are defined as the geometric median, Fréchet mean and variance, respectively. Their computation is enabled by recently published works, and embeds an algorithm for computing shortest paths in the space of trees. Studying the phylogeny of a set of plants, where several tree topologies occur in the posterior sample, the posterior mean balances correctly the contributions from the different topologies, where a consensus tree would be biased. Comparisons of the posterior mean, median and consensus trees with the ground truth using simulated data also reveals the benefits of a sound averaging method when reconstructing phylogenetic trees. Availability and implementation: We provide two independent implementations of the algorithm for computing Fréchet means, geometric medians and variances in the space of phylogenetic trees. TFBayes: https://github.com/pbenner/tfbayes, TrAP: https://github.com/bacak/TrAP. Contact:philipp.benner@mis.mpg.de
Collapse
Affiliation(s)
- Philipp Benner
- Max-Planck Institute for Mathematics in the Sciences, 04103 Leipzig, Germany and Isthmus SARL, 75002 Paris, France
| | - Miroslav Bačák
- Max-Planck Institute for Mathematics in the Sciences, 04103 Leipzig, Germany and Isthmus SARL, 75002 Paris, France
| | - Pierre-Yves Bourguignon
- Max-Planck Institute for Mathematics in the Sciences, 04103 Leipzig, Germany and Isthmus SARL, 75002 Paris, France Max-Planck Institute for Mathematics in the Sciences, 04103 Leipzig, Germany and Isthmus SARL, 75002 Paris, France
| |
Collapse
|
25
|
Quantification and Visualization of Variation in Anatomical Trees. ASSOCIATION FOR WOMEN IN MATHEMATICS SERIES 2015. [DOI: 10.1007/978-3-319-16348-2_5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
|
26
|
Huckemann S, Mattingly J, Miller E, Nolen J. Sticky central limit theorems at isolated hyperbolic planar singularities. ELECTRON J PROBAB 2015. [DOI: 10.1214/ejp.v20-3887] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
27
|
Nye TMW. An Algorithm for Constructing Principal Geodesics in Phylogenetic Treespace. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:304-315. [PMID: 26355778 DOI: 10.1109/tcbb.2014.2309599] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Most phylogenetic analyses result in a sample of trees, but summarizing and visualizing these samples can be challenging. Consensus trees often provide limited information about a sample, and so methods such as consensus networks, clustering and multidimensional scaling have been developed and applied to tree samples. This paper describes a stochastic algorithm for constructing a principal geodesic or line through treespace which is analogous to the first principal component in standard principal components analysis. A principal geodesic summarizes the most variable features of a sample of trees, in terms of both tree topology and branch lengths, and it can be visualized as an animation of smoothly changing trees. The algorithm performs a stochastic search through parameter space for a geodesic which minimizes the sum of squared projected distances of the data points. This procedure aims to identify the globally optimal principal geodesic, though convergence to locally optimal geodesics is possible. The methodology is illustrated by constructing principal geodesics for experimental and simulated data sets, demonstrating the insight into samples of trees that can be gained and how the method improves on a previously published approach. A java package called GeoPhytter for constructing and visualizing principal geodesics is freely available from www.ncl.ac.uk/ ntmwn/geophytter.
Collapse
|
28
|
Feragen A, Lo P, de Bruijne M, Nielsen M, Lauze F. Toward a theory of statistical tree-shape analysis. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2013; 35:2008-2021. [PMID: 23267202 DOI: 10.1109/tpami.2012.265] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
To develop statistical methods for shapes with a tree-structure, we construct a shape space framework for tree-shapes and study metrics on the shape space. This shape space has singularities which correspond to topological transitions in the represented trees. We study two closely related metrics on the shape space, TED and QED. QED is a quotient euclidean distance arising naturally from the shape space formulation, while TED is the classical tree edit distance. Using Gromov's metric geometry, we gain new insight into the geometries defined by TED and QED. We show that the new metric QED has nice geometric properties that are needed for statistical analysis: Geodesics always exist and are generically locally unique. Following this, we can also show the existence and generic local uniqueness of average trees for QED. TED, while having some algorithmic advantages, does not share these advantages. Along with the theoretical framework we provide experimental proof-of-concept results on synthetic data trees as well as small airway trees from pulmonary CT scans. This way, we illustrate that our framework has promising theoretical and qualitative properties necessary to build a theory of statistical tree-shape analysis.
Collapse
Affiliation(s)
- Aasa Feragen
- eScience Center, Department of Computer Science, University of Copenhagen, Universitetsparken 5, 2011 Copenhagan, Denmark.
| | | | | | | | | |
Collapse
|
29
|
Matsen FA, Evans SN. Edge principal components and squash clustering: using the special structure of phylogenetic placement data for sample comparison. PLoS One 2013; 8:e56859. [PMID: 23505415 PMCID: PMC3594297 DOI: 10.1371/journal.pone.0056859] [Citation(s) in RCA: 57] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2012] [Accepted: 01/16/2013] [Indexed: 01/30/2023] Open
Abstract
Principal components analysis (PCA) and hierarchical clustering are two of the most heavily used techniques for analyzing the differences between nucleic acid sequence samples taken from a given environment. They have led to many insights regarding the structure of microbial communities. We have developed two new complementary methods that leverage how this microbial community data sits on a phylogenetic tree. Edge principal components analysis enables the detection of important differences between samples that contain closely related taxa. Each principal component axis is a collection of signed weights on the edges of the phylogenetic tree, and these weights are easily visualized by a suitable thickening and coloring of the edges. Squash clustering outputs a (rooted) clustering tree in which each internal node corresponds to an appropriate "average" of the original samples at the leaves below the node. Moreover, the length of an edge is a suitably defined distance between the averaged samples associated with the two incident nodes, rather than the less interpretable average of distances produced by UPGMA, the most widely used hierarchical clustering method in this context. We present these methods and illustrate their use with data from the human microbiome.
Collapse
Affiliation(s)
- Frederick A Matsen
- Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America.
| | | |
Collapse
|
30
|
Feragen A, Owen M, Petersen J, Wille MMW, Thomsen LH, Dirksen A, de Bruijne M. Tree-space statistics and approximations for large-scale analysis of anatomical trees. INFORMATION PROCESSING IN MEDICAL IMAGING : PROCEEDINGS OF THE ... CONFERENCE 2013; 23:74-85. [PMID: 24683959 DOI: 10.1007/978-3-642-38868-2_7] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Statistical analysis of anatomical trees is hard to perform due to differences in the topological structure of the trees. In this paper we define statistical properties of leaf-labeled anatomical trees with geometric edge attributes by considering the anatomical trees as points in the geometric space of leaf-labeled trees. This tree-space is a geodesic metric space where any two trees are connected by a unique shortest path, which corresponds to a tree deformation. However, tree-space is not a manifold, and the usual strategy of performing statistical analysis in a tangent space and projecting onto tree-space is not available. Using tree-space and its shortest paths, a variety of statistical properties, such as mean, principal component, hypothesis testing and linear discriminant analysis can be defined. For some of these properties it is still an open problem how to compute them; others (like the mean) can be computed, but efficient alternatives are helpful in speeding up algorithms that use means iteratively, like hypothesis testing. In this paper, we take advantage of a very large dataset (N = 8016) to obtain computable approximations, under the assumption that the data trees parametrize the relevant parts of tree-space well. Using the developed approximate statistics, we illustrate how the structure and geometry of airway trees vary across a population and show that airway trees with Chronic Obstructive Pulmonary Disease come from a different distribution in tree-space than healthy ones. Software is available from http://image.diku.dk/aasa/software.php.
Collapse
|
31
|
|
32
|
Ponciano JM, Burleigh JG, Braun EL, Taper ML. Assessing parameter identifiability in phylogenetic models using data cloning. Syst Biol 2012; 61:955-72. [PMID: 22649181 PMCID: PMC3478565 DOI: 10.1093/sysbio/sys055] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2011] [Revised: 02/02/2012] [Accepted: 05/25/2012] [Indexed: 11/14/2022] Open
Abstract
The success of model-based methods in phylogenetics has motivated much research aimed at generating new, biologically informative models. This new computer-intensive approach to phylogenetics demands validation studies and sound measures of performance. To date there has been little practical guidance available as to when and why the parameters in a particular model can be identified reliably. Here, we illustrate how Data Cloning (DC), a recently developed methodology to compute the maximum likelihood estimates along with their asymptotic variance, can be used to diagnose structural parameter nonidentifiability (NI) and distinguish it from other parameter estimability problems, including when parameters are structurally identifiable, but are not estimable in a given data set (INE), and when parameters are identifiable, and estimable, but only weakly so (WE). The application of the DC theorem uses well-known and widely used Bayesian computational techniques. With the DC approach, practitioners can use Bayesian phylogenetics software to diagnose nonidentifiability. Theoreticians and practitioners alike now have a powerful, yet simple tool to detect nonidentifiability while investigating complex modeling scenarios, where getting closed-form expressions in a probabilistic study is complicated. Furthermore, here we also show how DC can be used as a tool to examine and eliminate the influence of the priors, in particular if the process of prior elicitation is not straightforward. Finally, when applied to phylogenetic inference, DC can be used to study at least two important statistical questions: assessing identifiability of discrete parameters, like the tree topology, and developing efficient sampling methods for computationally expensive posterior densities.
Collapse
|
33
|
Aydın B, Pataki G, Wang H, Ladha A, Bullitt E, Marron JS. New Approaches to Principal Component Analysis for Trees. STATISTICS IN BIOSCIENCES 2012. [DOI: 10.1007/s12561-012-9055-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|