1
|
Berling L, Collienne L, Gavryushkin A. Estimating the mean in the space of ranked phylogenetic trees. Bioinformatics 2024; 40:btae514. [PMID: 39177090 PMCID: PMC11364146 DOI: 10.1093/bioinformatics/btae514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Revised: 05/16/2024] [Accepted: 08/21/2024] [Indexed: 08/24/2024] Open
Abstract
MOTIVATION Reconstructing evolutionary histories of biological entities, such as genes, cells, organisms, populations, and species, from phenotypic and molecular sequencing data is central to many biological, palaeontological, and biomedical disciplines. Typically, due to uncertainties and incompleteness in data, the true evolutionary history (phylogeny) is challenging to estimate. Statistical modelling approaches address this problem by introducing and studying probability distributions over all possible evolutionary histories, but can also introduce uncertainties due to misspecification. In practice, computational methods are deployed to learn those distributions typically by sampling them. This approach, however, is fundamentally challenging as it requires designing and implementing various statistical methods over a space of phylogenetic trees (or treespace). Although the problem of developing statistics over a treespace has received substantial attention in the literature and numerous breakthroughs have been made, it remains largely unsolved. The challenge of solving this problem is 2-fold: a treespace has nontrivial often counter-intuitive geometry implying that much of classical Euclidean statistics does not immediately apply; many parametrizations of treespace with promising statistical properties are computationally hard, so they cannot be used in data analyses. As a result, there is no single conventional method for estimating even the most fundamental statistics over any treespace, such as mean and variance, and various heuristics are used in practice. Despite the existence of numerous tree summary methods to approximate means of probability distributions over a treespace based on its geometry, and the theoretical promise of this idea, none of the attempts resulted in a practical method for summarizing tree samples. RESULTS In this paper, we present a tree summary method along with useful properties of our chosen treespace while focusing on its impact on phylogenetic analyses of real datasets. We perform an extensive benchmark study and demonstrate that our method outperforms currently most popular methods with respect to a number of important 'quality' statistics. Further, we apply our method to three empirical datasets ranging from cancer evolution to linguistics and find novel insights into corresponding evolutionary problems in all of them. We hence conclude that this treespace is a promising candidate to serve as a foundation for developing statistics over phylogenetic trees analytically, as well as new computational tools for evolutionary data analyses. AVAILABILITY AND IMPLEMENTATION An implementation is available at https://github.com/bioDS/Centroid-Code.
Collapse
Affiliation(s)
- Lars Berling
- Biological Data Science Lab, School of Mathematics and Statistics, University of Canterbury, Christchurch 8041, New Zealand
| | - Lena Collienne
- Biological Data Science Lab, School of Mathematics and Statistics, University of Canterbury, Christchurch 8041, New Zealand
| | - Alex Gavryushkin
- Biological Data Science Lab, School of Mathematics and Statistics, University of Canterbury, Christchurch 8041, New Zealand
| |
Collapse
|
2
|
Collienne L, Whidden C, Gavryushkin A. Ranked Subtree Prune and Regraft. Bull Math Biol 2024; 86:24. [PMID: 38294587 PMCID: PMC10830682 DOI: 10.1007/s11538-023-01244-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 12/06/2023] [Indexed: 02/01/2024]
Abstract
Phylogenetic trees are a mathematical formalisation of evolutionary histories between organisms, species, genes, cancer cells, etc. For many applications, e.g. when analysing virus transmission trees or cancer evolution, (phylogenetic) time trees are of interest, where branch lengths represent times. Computational methods for reconstructing time trees from (typically molecular) sequence data, for example Bayesian phylogenetic inference using Markov Chain Monte Carlo (MCMC) methods, rely on algorithms that sample the treespace. They employ tree rearrangement operations such as [Formula: see text] (Subtree Prune and Regraft) and [Formula: see text] (Nearest Neighbour Interchange) or, in the case of time tree inference, versions of these that take times of internal nodes into account. While the classic [Formula: see text] tree rearrangement is well-studied, its variants for time trees are less understood, limiting comparative analysis for time tree methods. In this paper we consider a modification of the classical [Formula: see text] rearrangement on the space of ranked phylogenetic trees, which are trees equipped with a ranking of all internal nodes. This modification results in two novel treespaces, which we propose to study. We begin this study by discussing algorithmic properties of these treespaces, focusing on those relating to the complexity of computing distances under the ranked [Formula: see text] operations as well as similarities and differences to known tree rearrangement based treespaces. Surprisingly, we show the counterintuitive result that adding leaves to trees can actually decrease their ranked [Formula: see text] distance, which may have an impact on the results of time tree sampling algorithms given uncertain "rogue taxa".
Collapse
Affiliation(s)
- Lena Collienne
- Biological Data Science Laboratory, School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand.
| | - Chris Whidden
- Faculty of Computer Science, Dalhousie University, Halifax, Canada
| | - Alex Gavryushkin
- Biological Data Science Laboratory, School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand
- Biomathematics Research Centre, University of Canterbury, Christchurch, New Zealand
| |
Collapse
|
3
|
Penn MJ, Scheidwasser N, Penn J, Donnelly CA, Duchêne DA, Bhatt S. Leaping through Tree Space: Continuous Phylogenetic Inference for Rooted and Unrooted Trees. Genome Biol Evol 2023; 15:evad213. [PMID: 38085949 PMCID: PMC10745275 DOI: 10.1093/gbe/evad213] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/16/2023] [Indexed: 12/24/2023] Open
Abstract
Phylogenetics is now fundamental in life sciences, providing insights into the earliest branches of life and the origins and spread of epidemics. However, finding suitable phylogenies from the vast space of possible trees remains challenging. To address this problem, for the first time, we perform both tree exploration and inference in a continuous space where the computation of gradients is possible. This continuous relaxation allows for major leaps across tree space in both rooted and unrooted trees, and is less susceptible to convergence to local minima. Our approach outperforms the current best methods for inference on unrooted trees and, in simulation, accurately infers the tree and root in ultrametric cases. The approach is effective in cases of empirical data with negligible amounts of data, which we demonstrate on the phylogeny of jawed vertebrates. Indeed, only a few genes with an ultrametric signal were generally sufficient for resolving the major lineages of vertebrates. Optimization is possible via automatic differentiation and our method presents an effective way forward for exploring the most difficult, data-deficient phylogenetic questions.
Collapse
Affiliation(s)
- Matthew J Penn
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Neil Scheidwasser
- Section of Epidemiology, University of Copenhagen, Copenhagen, Denmark
| | - Joseph Penn
- Department of Physics, University of Oxford, Oxford, United Kingdom
| | - Christl A Donnelly
- Department of Statistics, University of Oxford, Oxford, United Kingdom
- Pandemic Sciences Institute, University of Oxford, Oxford, United Kingdom
- Department of Infectious Disease Epidemiology, MRC Centre for Global Infectious Disease Analysis, School of Public Health, Faculty of Medicine, Imperial College London, London, United Kingdom
| | - David A Duchêne
- Center for Evolutionary Hologenomics, Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Samir Bhatt
- Section of Epidemiology, University of Copenhagen, Copenhagen, Denmark
- Department of Infectious Disease Epidemiology, MRC Centre for Global Infectious Disease Analysis, School of Public Health, Faculty of Medicine, Imperial College London, London, United Kingdom
| |
Collapse
|
4
|
Moravec JC, Lanfear R, Spector DL, Diermeier SD, Gavryushkin A. Testing for Phylogenetic Signal in Single-Cell RNA-Seq Data. J Comput Biol 2023; 30:518-537. [PMID: 36475926 PMCID: PMC10125402 DOI: 10.1089/cmb.2022.0357] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Phylogenetic methods are emerging as a useful tool to understand cancer evolutionary dynamics, including tumor structure, heterogeneity, and progression. Most currently used approaches utilize either bulk whole genome sequencing or single-cell DNA sequencing and are based on calling copy number alterations and single nucleotide variants (SNVs). Single-cell RNA sequencing (scRNA-seq) is commonly applied to explore differential gene expression of cancer cells throughout tumor progression. The method exacerbates the single-cell sequencing problem of low yield per cell with uneven expression levels. This accounts for low and uneven sequencing coverage and makes SNV detection and phylogenetic analysis challenging. In this article, we demonstrate for the first time that scRNA-seq data contain sufficient evolutionary signal and can also be utilized in phylogenetic analyses. We explore and compare results of such analyses based on both expression levels and SNVs called from scRNA-seq data. Both techniques are shown to be useful for reconstructing phylogenetic relationships between cells, reflecting the clonal composition of a tumor. Both standardized expression values and SNVs appear to be equally capable of reconstructing a similar pattern of phylogenetic relationship. This pattern is stable even when phylogenetic uncertainty is taken in account. Our results open up a new direction of somatic phylogenetics based on scRNA-seq data. Further research is required to refine and improve these approaches to capture the full picture of somatic evolutionary dynamics in cancer.
Collapse
Affiliation(s)
- Jiří C. Moravec
- Department of Computer Science, University of Otago, Dunedin, New Zealand
- School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand
| | - Robert Lanfear
- Division of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, Australia
| | | | | | - Alex Gavryushkin
- School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand
| |
Collapse
|
5
|
Douglas J, Jiménez-Silva CL, Bouckaert R. OUP accepted manuscript. Syst Biol 2022; 71:901-916. [PMID: 35176772 PMCID: PMC9248896 DOI: 10.1093/sysbio/syac010] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2021] [Revised: 02/01/2022] [Accepted: 02/08/2022] [Indexed: 11/16/2022] Open
Abstract
As genomic sequence data become increasingly available, inferring the phylogeny of the
species as that of concatenated genomic data can be enticing. However, this approach makes
for a biased estimator of branch lengths and substitution rates and an inconsistent
estimator of tree topology. Bayesian multispecies coalescent (MSC) methods address these
issues. This is achieved by constraining a set of gene trees within a species tree and
jointly inferring both under a Bayesian framework. However, this approach comes at the
cost of increased computational demand. Here, we introduce StarBeast3—a software package
for efficient Bayesian inference under the MSC model via Markov chain Monte Carlo. We gain
efficiency by introducing cutting-edge proposal kernels and adaptive operators, and
StarBeast3 is particularly efficient when a relaxed clock model is applied. Furthermore,
gene-tree inference is parallelized, allowing the software to scale with the size of the
problem. We validated our software and benchmarked its performance using three real and
two synthetic data sets. Our results indicate that StarBeast3 is up to one-and-a-half
orders of magnitude faster than StarBeast2, and therefore more than two orders faster than
*BEAST, depending on the data set and on the parameter, and can achieve convergence on
large data sets with hundreds of genes. StarBeast3 is open-source and is easy to set up
with a friendly graphical user interface. [Adaptive; Bayesian inference; BEAST 2;
effective population sizes; high performance; multispecies coalescent; parallelization;
phylogenetics.]
Collapse
Affiliation(s)
- Jordan Douglas
- School of Computer Science, University of Auckland, 9 Symonds
Street Level 1 Student Commons, Auckland 1010, New Zealand
- Correspondence to be sent to: School of Computer Science,
University of Auckland, 9 Symonds Street Level 1 Student Commons, Auckland 1010, New
Zealand; E-mail:
| | - Cinthy L Jiménez-Silva
- School of Computer Science, University of Auckland, 9 Symonds
Street Level 1 Student Commons, Auckland 1010, New Zealand
| | - Remco Bouckaert
- School of Computer Science, University of Auckland, 9 Symonds
Street Level 1 Student Commons, Auckland 1010, New Zealand
| |
Collapse
|
6
|
Collienne L, Elmes K, Fischer M, Bryant D, Gavryushkin A. Discrete coalescent trees. J Math Biol 2021; 83:60. [PMID: 34739608 PMCID: PMC8571255 DOI: 10.1007/s00285-021-01685-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2021] [Revised: 09/11/2021] [Accepted: 10/13/2021] [Indexed: 11/29/2022]
Abstract
In many phylogenetic applications, such as cancer and virus evolution, time trees, evolutionary histories where speciation events are timed, are inferred. Of particular interest are clock-like trees, where all leaves are sampled at the same time and have equal distance to the root. One popular approach to model clock-like trees is coalescent theory, which is used in various tree inference software packages. Methodologically, phylogenetic inference methods require a tree space over which the inference is performed, and the geometry of this space plays an important role in statistical and computational aspects of tree inference algorithms. It has recently been shown that coalescent tree spaces possess a unique geometry, different from that of classical phylogenetic tree spaces. Here we introduce and study a space of discrete coalescent trees. They assume that time is discrete, which is natural in many computational applications. This tree space is a generalisation of the previously studied ranked nearest neighbour interchange space, and is built upon tree-rearrangement operations. We generalise existing results about ranked trees, including an algorithm for computing distances in polynomial time, and in particular provide new results for both the space of discrete coalescent trees and the space of ranked trees. We establish several geometrical properties of these spaces and show how these properties impact various algorithms used in phylogenetic analyses. Our tree space is a discretisation of a previously introduced time tree space, called t-space, and hence our results can be used to approximate solutions to various open problems in t-space.
Collapse
Affiliation(s)
- Lena Collienne
- Department of Computer Science, University of Otago, Dunedin, New Zealand
| | - Kieran Elmes
- Department of Computer Science, University of Otago, Dunedin, New Zealand
| | - Mareike Fischer
- Institute of Mathematics and Computer Science, University of Greifswald, Greifswald, Germany
| | - David Bryant
- Department of Mathematics and Statistics, University of Otago, Dunedin, New Zealand
| | - Alex Gavryushkin
- Department of Computer Science, University of Otago, Dunedin, New Zealand. .,School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand.
| |
Collapse
|