1
|
Samyak R, Palacios JA. Statistical summaries of unlabelled evolutionary trees. Biometrika 2024; 111:171-193. [PMID: 38352626 PMCID: PMC10861027 DOI: 10.1093/biomet/asad025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Indexed: 02/16/2024] Open
Abstract
Rooted and ranked phylogenetic trees are mathematical objects that are useful in modelling hierarchical data and evolutionary relationships with applications to many fields such as evolutionary biology and genetic epidemiology. Bayesian phylogenetic inference usually explores the posterior distribution of trees via Markov chain Monte Carlo methods. However, assessing uncertainty and summarizing distributions remains challenging for these types of structures. While labelled phylogenetic trees have been extensively studied, relatively less literature exists for unlabelled trees that are increasingly useful, for example when one seeks to summarize samples of trees obtained with different methods, or from different samples and environments, and wishes to assess the stability and generalizability of these summaries. In our paper, we exploit recently proposed distance metrics of unlabelled ranked binary trees and unlabelled ranked genealogies, or trees equipped with branch lengths, to define the Fréchet mean, variance and interquartile sets as summaries of these tree distributions. We provide an efficient combinatorial optimization algorithm for computing the Fréchet mean of a sample or of distributions on unlabelled ranked tree shapes and unlabelled ranked genealogies. We show the applicability of our summary statistics for studying popular tree distributions and for comparing the SARS-CoV-2 evolutionary trees across different locations during the COVID-19 epidemic in 2020. Our current implementations are publicly available at https://github.com/RSamyak/fmatrix.
Collapse
Affiliation(s)
- Rajanala Samyak
- Department of Statistics, Stanford University, 390 Jane Stanford Way, Stanford, California 94305, U.S.A
| | - Julia A Palacios
- Department of Statistics, Stanford University, 390 Jane Stanford Way, Stanford, California 94305, U.S.A
| |
Collapse
|
2
|
Dilber E, Terhorst J. Robust detection of natural selection using a probabilistic model of tree imbalance. Genetics 2022; 220:6511494. [PMID: 35100408 PMCID: PMC8893258 DOI: 10.1093/genetics/iyac009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2021] [Accepted: 12/16/2021] [Indexed: 01/21/2023] Open
Abstract
Neutrality tests such as Tajima's D and Fay and Wu's H are standard implements in the population genetics toolbox. One of their most common uses is to scan the genome for signals of natural selection. However, it is well understood that D and H are confounded by other evolutionary forces-in particular, population expansion-that may be unrelated to selection. Because they are not model-based, it is not clear how to deconfound these tests in a principled way. In this article, we derive new likelihood-based methods for detecting natural selection, which are robust to fluctuations in effective population size. At the core of our method is a novel probabilistic model of tree imbalance, which generalizes Kingman's coalescent to allow certain aberrant tree topologies to arise more frequently than is expected under neutrality. We derive a frequency spectrum-based estimator that can be used in place of D, and also extend to the case where genealogies are first estimated. We benchmark our methods on real and simulated data, and provide an open source software implementation.
Collapse
Affiliation(s)
- Enes Dilber
- Department of Statistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Jonathan Terhorst
- Department of Statistics, University of Michigan, Ann Arbor, MI 48109, USA,Corresponding author: Department of Statistics, University of Michigan, Ann Arbor, MI 48109, USA.
| |
Collapse
|
3
|
Abstract
Phylogenetic networks represent evolutionary history of species and can record natural reticulate evolutionary processes such as horizontal gene transfer and gene recombination. This makes phylogenetic networks a more comprehensive representation of evolutionary history compared to phylogenetic trees. Stochastic processes for generating random trees or networks are important tools in evolutionary analysis, especially in phylogeny reconstruction where they can be utilized for validation or serve as priors for Bayesian methods. However, as more network generators are developed, there is a lack of discussion or comparison for different generators. To bridge this gap, we compare a set of phylogenetic network generators by profiling topological summary statistics of the generated networks over the number of reticulations and comparing the topological profiles.
Collapse
Affiliation(s)
- Remie Janssen
- Delft University of Technology, Delft Institute of Applied Mathematics, Mekelweg 4, 2628 CD, Delft, The Netherlands
| | - Pengyu Liu
- Simon Fraser University, Department of Mathematics, 8888 University Drive, Burnaby, BC V5A 1S6, Canada
| |
Collapse
|
4
|
Assessing the performance of methods for copy number aberration detection from single-cell DNA sequencing data. PLoS Comput Biol 2020; 16:e1008012. [PMID: 32658894 PMCID: PMC7377518 DOI: 10.1371/journal.pcbi.1008012] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Revised: 07/23/2020] [Accepted: 06/03/2020] [Indexed: 12/22/2022] Open
Abstract
Single-cell DNA sequencing technologies are enabling the study of mutations and their evolutionary trajectories in cancer. Somatic copy number aberrations (CNAs) have been implicated in the development and progression of various types of cancer. A wide array of methods for CNA detection has been either developed specifically for or adapted to single-cell DNA sequencing data. Understanding the strengths and limitations that are unique to each of these methods is very important for obtaining accurate copy number profiles from single-cell DNA sequencing data. We benchmarked three widely used methods–Ginkgo, HMMcopy, and CopyNumber–on simulated as well as real datasets. To facilitate this, we developed a novel simulator of single-cell genome evolution in the presence of CNAs. Furthermore, to assess performance on empirical data where the ground truth is unknown, we introduce a phylogeny-based measure for identifying potentially erroneous inferences. While single-cell DNA sequencing is very promising for elucidating and understanding CNAs, our findings show that even the best existing method does not exceed 80% accuracy. New methods that significantly improve upon the accuracy of these three methods are needed. Furthermore, with the large datasets being generated, the methods must be computationally efficient. Copy number aberrations, or CNAs, refer to evolutionary events that act on cancer genomes by deleting segments of the genomes or introducing new copies of existing segments. These events have been implicated in various types of cancer; consequently, their accurate detection could shed light on the initiation and progression of tumor, as well as on the development of potential targeted therapeutics. Single-cell DNA sequencing technologies are now producing the type of data that would allow such detection at the resolution of individual cells. However, to achieve this detection task, methods have to implement several steps of “data wrangling” and dealing with technical artifacts. In this work, we benchmarked three widely used methods for CNA detection from single-cell DNA data, namely Ginkgo, HMMcopy, and CopyNumber. To accomplish this study, we developed a novel simulator and devised a phylogeny-based measure of potentially erroneous CNA calls. We find that none of these methods has high accuracy, and all of them can be computationally very demanding. These findings call for the development of more accurate and more efficient methods for CNA detection from single-cell DNA data.
Collapse
|
5
|
Zafar H, Navin N, Chen K, Nakhleh L. SiCloneFit: Bayesian inference of population structure, genotype, and phylogeny of tumor clones from single-cell genome sequencing data. Genome Res 2019; 29:1847-1859. [PMID: 31628257 PMCID: PMC6836738 DOI: 10.1101/gr.243121.118] [Citation(s) in RCA: 67] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2018] [Accepted: 07/10/2019] [Indexed: 12/12/2022]
Abstract
Accumulation and selection of somatic mutations in a Darwinian framework result in intra-tumor heterogeneity (ITH) that poses significant challenges to the diagnosis and clinical therapy of cancer. Identification of the tumor cell populations (clones) and reconstruction of their evolutionary relationship can elucidate this heterogeneity. Recently developed single-cell DNA sequencing (SCS) technologies promise to resolve ITH to a single-cell level. However, technical errors in SCS data sets, including false-positives (FP) and false-negatives (FN) due to allelic dropout, and cell doublets, significantly complicate these tasks. Here, we propose a nonparametric Bayesian method that reconstructs the clonal populations as clusters of single cells, genotypes of each clone, and the evolutionary relationship between the clones. It employs a tree-structured Chinese restaurant process as the prior on the number and composition of clonal populations. The evolution of the clonal populations is modeled by a clonal phylogeny and a finite-site model of evolution to account for potential mutation recurrence and losses. We probabilistically account for FP and FN errors, and cell doublets are modeled by employing a Beta-binomial distribution. We develop a Gibbs sampling algorithm comprising partial reversible-jump and partial Metropolis-Hastings updates to explore the joint posterior space of all parameters. The performance of our method on synthetic and experimental data sets suggests that joint reconstruction of tumor clones and clonal phylogeny under a finite-site model of evolution leads to more accurate inferences. Our method is the first to enable this joint reconstruction in a fully Bayesian framework, thus providing measures of support of the inferences it makes.
Collapse
Affiliation(s)
- Hamim Zafar
- Department of Computer Science, Rice University, Houston, Texas 77005, USA
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA
| | - Nicholas Navin
- Department of Genetics, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA
| | - Ken Chen
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA
| | - Luay Nakhleh
- Department of Computer Science, Rice University, Houston, Texas 77005, USA
| |
Collapse
|
6
|
The Probabilities of Trees and Cladograms under Ford's α-Model. ScientificWorldJournal 2018; 2018:1916094. [PMID: 29849509 PMCID: PMC5932459 DOI: 10.1155/2018/1916094] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2018] [Accepted: 03/08/2018] [Indexed: 11/17/2022] Open
Abstract
Ford's α-model is one of the most popular random parametric models of bifurcating phylogenetic tree growth, having as specific instances both the uniform and the Yule models. Its general properties have been used to study the behavior of phylogenetic tree shape indices under the probability distribution it defines. But the explicit formulas provided by Ford for the probabilities of unlabeled trees and phylogenetic trees fail in some cases. In this paper we give correct explicit formulas for these probabilities.
Collapse
|
7
|
Maliet O, Gascuel F, Lambert A. Ranked Tree Shapes, Nonrandom Extinctions, and the Loss of Phylogenetic Diversity. Syst Biol 2018; 67:1025-1040. [DOI: 10.1093/sysbio/syy030] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2016] [Accepted: 04/08/2018] [Indexed: 11/13/2022] Open
Affiliation(s)
- Odile Maliet
- Institut de Biologie de l’École Normale Supérieure (IBENS), École Normale Supérieure, CNRS, INSERM, PSL Research University, Paris, France
- ED 227, Sorbonne Universités, Paris, France
| | - Fanny Gascuel
- Institut de Biologie de l’École Normale Supérieure (IBENS), École Normale Supérieure, CNRS, INSERM, PSL Research University, Paris, France
- ED 227, Sorbonne Universités, Paris, France
- Center for Interdisciplinary Research in Biology (CIRB), Collège de France, CNRS, INSERM, PSL Research University, Paris, France
| | - Amaury Lambert
- Center for Interdisciplinary Research in Biology (CIRB), Collège de France, CNRS, INSERM, PSL Research University, Paris, France
- Laboratoire Probabilités, Statistique et Modélisation (LPSM), Sorbonne Université, CNRS, Paris, France
| |
Collapse
|
8
|
Zhan X, Ukkusuri SV, Rao PSC. Dynamics of functional failures and recovery in complex road networks. Phys Rev E 2017; 96:052301. [PMID: 29347691 DOI: 10.1103/physreve.96.052301] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2017] [Indexed: 06/07/2023]
Abstract
We propose a new framework for modeling the evolution of functional failures and recoveries in complex networks, with traffic congestion on road networks as the case study. Differently from conventional approaches, we transform the evolution of functional states into an equivalent dynamic structural process: dual-vertex splitting and coalescing embedded within the original network structure. The proposed model successfully explains traffic congestion and recovery patterns at the city scale based on high-resolution data from two megacities. Numerical analysis shows that certain network structural attributes can amplify or suppress cascading functional failures. Our approach represents a new general framework to model functional failures and recoveries in flow-based networks and allows understanding of the interplay between structure and function for flow-induced failure propagation and recovery.
Collapse
Affiliation(s)
- Xianyuan Zhan
- Lyles School of Civil Engineering, Purdue University, West Lafayette, Indiana 47907, USA
| | - Satish V Ukkusuri
- Lyles School of Civil Engineering, Purdue University, West Lafayette, Indiana 47907, USA
| | - P Suresh C Rao
- Lyles School of Civil Engineering and Agronomy Department, Purdue University, West Lafayette, Indiana 47907, USA
| |
Collapse
|
9
|
Sainudiin R, Welch D. The transmission process: A combinatorial stochastic process for the evolution of transmission trees over networks. J Theor Biol 2016; 410:137-170. [PMID: 27519948 DOI: 10.1016/j.jtbi.2016.07.038] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2016] [Revised: 07/22/2016] [Accepted: 07/22/2016] [Indexed: 10/21/2022]
Abstract
We derive a combinatorial stochastic process for the evolution of the transmission tree over the infected vertices of a host contact network in a susceptible-infected (SI) model of an epidemic. Models of transmission trees are crucial to understanding the evolution of pathogen populations. We provide an explicit description of the transmission process on the product state space of (rooted planar ranked labelled) binary transmission trees and labelled host contact networks with SI-tags as a discrete-state continuous-time Markov chain. We give the exact probability of any transmission tree when the host contact network is a complete, star or path network - three illustrative examples. We then develop a biparametric Beta-splitting model that directly generates transmission trees with exact probabilities as a function of the model parameters, but without explicitly modelling the underlying contact network, and show that for specific values of the parameters we can recover the exact probabilities for our three example networks through the Markov chain construction that explicitly models the underlying contact network. We use the maximum likelihood estimator (MLE) to consistently infer the two parameters driving the transmission process based on observations of the transmission trees and use the exact MLE to characterize equivalence classes over the space of contact networks with a single initial infection. An exploratory simulation study of the MLEs from transmission trees sampled from three other deterministic and four random families of classical contact networks is conducted to shed light on the relation between the MLEs of these families with some implications for statistical inference along with pointers to further extensions of our models. The insights developed here are also applicable to the simplest models of "meme" evolution in online social media networks through transmission events that can be distilled from observable actions such as "likes", "mentions", "retweets" and "+1s" along with any concomitant comments.
Collapse
Affiliation(s)
- Raazesh Sainudiin
- Laboratory for Mathematical Statistical Experiments, Christchurch Centre and Biomathematics Research Centre, School of Mathematics and Statistics, University of Canterbury, Private Bag 4800, Christchurch 8041, New Zealand.
| | - David Welch
- Computational Evolution Group and Department of Computer Science, University of Auckland, Private Bag 92019, Auckland 1142, New Zealand.
| |
Collapse
|