1
|
Xie T, Yuan M, Deng M, Zhang C. Improving Tree Probability Estimation with Stochastic Optimization and Variance Reduction. ARXIV 2024:arXiv:2409.05282v1. [PMID: 39314503 PMCID: PMC11419179] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 09/25/2024]
Abstract
Probability estimation of tree topologies is one of the fundamental tasks in phylogenetic inference. The recently proposed subsplit Bayesian networks (SBNs) provide a powerful probabilistic graphical model for tree topology probability estimation by properly leveraging the hierarchical structure of phylogenetic trees. However, the expectation maximization (EM) method currently used for learning SBN parameters does not scale up to large data sets. In this paper, we introduce several computationally efficient methods for training SBNs and show that variance reduction could be the key for better performance. Furthermore, we also introduce the variance reduction technique to improve the optimization of SBN parameters for variational Bayesian phylogenetic inference (VBPI). Extensive synthetic and real data experiments demonstrate that our methods outperform previous baseline methods on the tasks of tree topology probability estimation as well as Bayesian phylogenetic inference using SBNs.
Collapse
Affiliation(s)
- Tianyu Xie
- School of Mathematical Sciences, Peking University, Beijing, 100871, China
| | - Musu Yuan
- Center for Quantitative Biology, Peking University, Beijing, 100871, China
| | - Minghua Deng
- Center for Quantitative Biology, School of Mathematical Sciences, and Center for Statistical Science, Peking University, Beijing, 100871, China
| | - Cheng Zhang
- School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, 100871, China
| |
Collapse
|
2
|
Berling L, Collienne L, Gavryushkin A. Estimating the mean in the space of ranked phylogenetic trees. Bioinformatics 2024; 40:btae514. [PMID: 39177090 PMCID: PMC11364146 DOI: 10.1093/bioinformatics/btae514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Revised: 05/16/2024] [Accepted: 08/21/2024] [Indexed: 08/24/2024] Open
Abstract
MOTIVATION Reconstructing evolutionary histories of biological entities, such as genes, cells, organisms, populations, and species, from phenotypic and molecular sequencing data is central to many biological, palaeontological, and biomedical disciplines. Typically, due to uncertainties and incompleteness in data, the true evolutionary history (phylogeny) is challenging to estimate. Statistical modelling approaches address this problem by introducing and studying probability distributions over all possible evolutionary histories, but can also introduce uncertainties due to misspecification. In practice, computational methods are deployed to learn those distributions typically by sampling them. This approach, however, is fundamentally challenging as it requires designing and implementing various statistical methods over a space of phylogenetic trees (or treespace). Although the problem of developing statistics over a treespace has received substantial attention in the literature and numerous breakthroughs have been made, it remains largely unsolved. The challenge of solving this problem is 2-fold: a treespace has nontrivial often counter-intuitive geometry implying that much of classical Euclidean statistics does not immediately apply; many parametrizations of treespace with promising statistical properties are computationally hard, so they cannot be used in data analyses. As a result, there is no single conventional method for estimating even the most fundamental statistics over any treespace, such as mean and variance, and various heuristics are used in practice. Despite the existence of numerous tree summary methods to approximate means of probability distributions over a treespace based on its geometry, and the theoretical promise of this idea, none of the attempts resulted in a practical method for summarizing tree samples. RESULTS In this paper, we present a tree summary method along with useful properties of our chosen treespace while focusing on its impact on phylogenetic analyses of real datasets. We perform an extensive benchmark study and demonstrate that our method outperforms currently most popular methods with respect to a number of important 'quality' statistics. Further, we apply our method to three empirical datasets ranging from cancer evolution to linguistics and find novel insights into corresponding evolutionary problems in all of them. We hence conclude that this treespace is a promising candidate to serve as a foundation for developing statistics over phylogenetic trees analytically, as well as new computational tools for evolutionary data analyses. AVAILABILITY AND IMPLEMENTATION An implementation is available at https://github.com/bioDS/Centroid-Code.
Collapse
Affiliation(s)
- Lars Berling
- Biological Data Science Lab, School of Mathematics and Statistics, University of Canterbury, Christchurch 8041, New Zealand
| | - Lena Collienne
- Biological Data Science Lab, School of Mathematics and Statistics, University of Canterbury, Christchurch 8041, New Zealand
| | - Alex Gavryushkin
- Biological Data Science Lab, School of Mathematics and Statistics, University of Canterbury, Christchurch 8041, New Zealand
| |
Collapse
|
3
|
Magee A, Karcher M, Matsen FA, Minin VM. How Trustworthy Is Your Tree? Bayesian Phylogenetic Effective Sample Size Through the Lens of Monte Carlo Error. BAYESIAN ANALYSIS 2024; 19:565-593. [PMID: 38665694 PMCID: PMC11042687 DOI: 10.1214/22-ba1339] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/28/2024]
Abstract
Bayesian inference is a popular and widely-used approach to infer phylogenies (evolutionary trees). However, despite decades of widespread application, it remains difficult to judge how well a given Bayesian Markov chain Monte Carlo (MCMC) run explores the space of phylogenetic trees. In this paper, we investigate the Monte Carlo error of phylogenies, focusing on high-dimensional summaries of the posterior distribution, including variability in estimated edge/branch (known in phylogenetics as "split") probabilities and tree probabilities, and variability in the estimated summary tree. Specifically, we ask if there is any measure of effective sample size (ESS) applicable to phylogenetic trees which is capable of capturing the Monte Carlo error of these three summary measures. We find that there are some ESS measures capable of capturing the error inherent in using MCMC samples to approximate the posterior distributions on phylogenies. We term these tree ESS measures, and identify a set of three which are useful in practice for assessing the Monte Carlo error. Lastly, we present visualization tools that can improve comparisons between multiple independent MCMC runs by accounting for the Monte Carlo error present in each chain. Our results indicate that common post-MCMC workflows are insufficient to capture the inherent Monte Carlo error of the tree, and highlight the need for both within-chain mixing and between-chain convergence assessments.
Collapse
Affiliation(s)
- Andrew Magee
- Department of Biology, University of Washington, Seattle, WA, 98195, USA
| | - Michael Karcher
- Department of Mathematics and Computer Science, Muhlenberg College, Allentown, PA, 18104, USA
| | - Frederick A. Matsen
- Howard Hughes Medical Institute, Fred Hutchison Cancer Research Center, Departments of Genome Sciences and Statistics, University of Washington, Seattle, WA, 98109, USA
| | - Volodymyr M. Minin
- Department of Statistics, University of California, Irvine, Irvine, CA, 92697, USA
| |
Collapse
|
4
|
Zou Y, Zhang Z, Zeng Y, Hu H, Hao Y, Huang S, Li B. Common Methods for Phylogenetic Tree Construction and Their Implementation in R. Bioengineering (Basel) 2024; 11:480. [PMID: 38790347 PMCID: PMC11117635 DOI: 10.3390/bioengineering11050480] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Revised: 05/04/2024] [Accepted: 05/07/2024] [Indexed: 05/26/2024] Open
Abstract
A phylogenetic tree can reflect the evolutionary relationships between species or gene families, and they play a critical role in modern biological research. In this review, we summarize common methods for constructing phylogenetic trees, including distance methods, maximum parsimony, maximum likelihood, Bayesian inference, and tree-integration methods (supermatrix and supertree). Here we discuss the advantages, shortcomings, and applications of each method and offer relevant codes to construct phylogenetic trees from molecular data using packages and algorithms in R. This review aims to provide comprehensive guidance and reference for researchers seeking to construct phylogenetic trees while also promoting further development and innovation in this field. By offering a clear and concise overview of the different methods available, we hope to enable researchers to select the most appropriate approach for their specific research questions and datasets.
Collapse
Affiliation(s)
- Yue Zou
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (Y.Z.); (Z.Z.); (Y.Z.); (H.H.); (Y.H.)
| | - Zixuan Zhang
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (Y.Z.); (Z.Z.); (Y.Z.); (H.H.); (Y.H.)
| | - Yujie Zeng
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (Y.Z.); (Z.Z.); (Y.Z.); (H.H.); (Y.H.)
| | - Hanyue Hu
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (Y.Z.); (Z.Z.); (Y.Z.); (H.H.); (Y.H.)
| | - Youjin Hao
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (Y.Z.); (Z.Z.); (Y.Z.); (H.H.); (Y.H.)
| | - Sheng Huang
- Animal Nutrition Institute, Chongqing Academy of Animal Science, Chongqing 402460, China
| | - Bo Li
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (Y.Z.); (Z.Z.); (Y.Z.); (H.H.); (Y.H.)
| |
Collapse
|
5
|
Collienne L, Whidden C, Gavryushkin A. Ranked Subtree Prune and Regraft. Bull Math Biol 2024; 86:24. [PMID: 38294587 PMCID: PMC10830682 DOI: 10.1007/s11538-023-01244-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 12/06/2023] [Indexed: 02/01/2024]
Abstract
Phylogenetic trees are a mathematical formalisation of evolutionary histories between organisms, species, genes, cancer cells, etc. For many applications, e.g. when analysing virus transmission trees or cancer evolution, (phylogenetic) time trees are of interest, where branch lengths represent times. Computational methods for reconstructing time trees from (typically molecular) sequence data, for example Bayesian phylogenetic inference using Markov Chain Monte Carlo (MCMC) methods, rely on algorithms that sample the treespace. They employ tree rearrangement operations such as [Formula: see text] (Subtree Prune and Regraft) and [Formula: see text] (Nearest Neighbour Interchange) or, in the case of time tree inference, versions of these that take times of internal nodes into account. While the classic [Formula: see text] tree rearrangement is well-studied, its variants for time trees are less understood, limiting comparative analysis for time tree methods. In this paper we consider a modification of the classical [Formula: see text] rearrangement on the space of ranked phylogenetic trees, which are trees equipped with a ranking of all internal nodes. This modification results in two novel treespaces, which we propose to study. We begin this study by discussing algorithmic properties of these treespaces, focusing on those relating to the complexity of computing distances under the ranked [Formula: see text] operations as well as similarities and differences to known tree rearrangement based treespaces. Surprisingly, we show the counterintuitive result that adding leaves to trees can actually decrease their ranked [Formula: see text] distance, which may have an impact on the results of time tree sampling algorithms given uncertain "rogue taxa".
Collapse
Affiliation(s)
- Lena Collienne
- Biological Data Science Laboratory, School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand.
| | - Chris Whidden
- Faculty of Computer Science, Dalhousie University, Halifax, Canada
| | - Alex Gavryushkin
- Biological Data Science Laboratory, School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand
- Biomathematics Research Centre, University of Canterbury, Christchurch, New Zealand
| |
Collapse
|
6
|
Penn MJ, Scheidwasser N, Penn J, Donnelly CA, Duchêne DA, Bhatt S. Leaping through Tree Space: Continuous Phylogenetic Inference for Rooted and Unrooted Trees. Genome Biol Evol 2023; 15:evad213. [PMID: 38085949 PMCID: PMC10745275 DOI: 10.1093/gbe/evad213] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/16/2023] [Indexed: 12/24/2023] Open
Abstract
Phylogenetics is now fundamental in life sciences, providing insights into the earliest branches of life and the origins and spread of epidemics. However, finding suitable phylogenies from the vast space of possible trees remains challenging. To address this problem, for the first time, we perform both tree exploration and inference in a continuous space where the computation of gradients is possible. This continuous relaxation allows for major leaps across tree space in both rooted and unrooted trees, and is less susceptible to convergence to local minima. Our approach outperforms the current best methods for inference on unrooted trees and, in simulation, accurately infers the tree and root in ultrametric cases. The approach is effective in cases of empirical data with negligible amounts of data, which we demonstrate on the phylogeny of jawed vertebrates. Indeed, only a few genes with an ultrametric signal were generally sufficient for resolving the major lineages of vertebrates. Optimization is possible via automatic differentiation and our method presents an effective way forward for exploring the most difficult, data-deficient phylogenetic questions.
Collapse
Affiliation(s)
- Matthew J Penn
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Neil Scheidwasser
- Section of Epidemiology, University of Copenhagen, Copenhagen, Denmark
| | - Joseph Penn
- Department of Physics, University of Oxford, Oxford, United Kingdom
| | - Christl A Donnelly
- Department of Statistics, University of Oxford, Oxford, United Kingdom
- Pandemic Sciences Institute, University of Oxford, Oxford, United Kingdom
- Department of Infectious Disease Epidemiology, MRC Centre for Global Infectious Disease Analysis, School of Public Health, Faculty of Medicine, Imperial College London, London, United Kingdom
| | - David A Duchêne
- Center for Evolutionary Hologenomics, Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Samir Bhatt
- Section of Epidemiology, University of Copenhagen, Copenhagen, Denmark
- Department of Infectious Disease Epidemiology, MRC Centre for Global Infectious Disease Analysis, School of Public Health, Faculty of Medicine, Imperial College London, London, United Kingdom
| |
Collapse
|
7
|
Dumm W, Barker M, Howard-Snyder W, DeWitt Iii WS, Matsen Iv FA. Representing and extending ensembles of parsimonious evolutionary histories with a directed acyclic graph. J Math Biol 2023; 87:75. [PMID: 37878119 PMCID: PMC10600060 DOI: 10.1007/s00285-023-02006-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Revised: 09/12/2023] [Accepted: 09/26/2023] [Indexed: 10/26/2023]
Abstract
In many situations, it would be useful to know not just the best phylogenetic tree for a given data set, but the collection of high-quality trees. This goal is typically addressed using Bayesian techniques, however, current Bayesian methods do not scale to large data sets. Furthermore, for large data sets with relatively low signal one cannot even store every good tree individually, especially when the trees are required to be bifurcating. In this paper, we develop a novel object called the "history subpartition directed acyclic graph" (or "history sDAG" for short) that compactly represents an ensemble of trees with labels (e.g. ancestral sequences) mapped onto the internal nodes. The history sDAG can be built efficiently and can also be efficiently trimmed to only represent maximally parsimonious trees. We show that the history sDAG allows us to find many additional equally parsimonious trees, extending combinatorially beyond the ensemble used to construct it. We argue that this object could be useful as the "skeleton" of a more complete uncertainty quantification.
Collapse
Affiliation(s)
- Will Dumm
- Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
- Howard Hughes Medical Institute, Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
| | - Mary Barker
- Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
- Howard Hughes Medical Institute, Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
| | - William Howard-Snyder
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington, USA
| | - William S DeWitt Iii
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California, USA
| | - Frederick A Matsen Iv
- Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA.
- Howard Hughes Medical Institute, Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA.
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA.
- Department of Statistics, University of Washington, Seattle, Washington, USA.
| |
Collapse
|
8
|
Jun SH, Nasif H, Jennings-Shaffer C, Rich DH, Kooperberg A, Fourment M, Zhang C, Suchard MA, Matsen FA. A topology-marginal composite likelihood via a generalized phylogenetic pruning algorithm. Algorithms Mol Biol 2023; 18:10. [PMID: 37525243 PMCID: PMC10391877 DOI: 10.1186/s13015-023-00235-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Accepted: 07/03/2023] [Indexed: 08/02/2023] Open
Abstract
Bayesian phylogenetics is a computationally challenging inferential problem. Classical methods are based on random-walk Markov chain Monte Carlo (MCMC), where random proposals are made on the tree parameter and the continuous parameters simultaneously. Variational phylogenetics is a promising alternative to MCMC, in which one fits an approximating distribution to the unnormalized phylogenetic posterior. Previous work fit this variational approximation using stochastic gradient descent, which is the canonical way of fitting general variational approximations. However, phylogenetic trees are special structures, giving opportunities for efficient computation. In this paper we describe a new algorithm that directly generalizes the Felsenstein pruning algorithm (a.k.a. sum-product algorithm) to compute a composite-like likelihood by marginalizing out ancestral states and subtrees simultaneously. We show the utility of this algorithm by rapidly making point estimates for branch lengths of a multi-tree phylogenetic model. These estimates accord with a long MCMC run and with estimates obtained using a variational method, but are much faster to obtain. Thus, although generalized pruning does not lead to a variational algorithm as such, we believe that it will form a useful starting point for variational inference.
Collapse
Affiliation(s)
- Seong-Hwan Jun
- Department of Biostatistics and Computational Biology, University of Rochester, Rochester, USA
| | - Hassan Nasif
- Department of Statistics, University of Washington, Seattle, USA
| | - Chris Jennings-Shaffer
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA USA
| | - David H Rich
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA USA
| | - Anna Kooperberg
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA USA
| | - Mathieu Fourment
- Australian Institute for Microbiology and Infection, University of Technology Sydney, Ultimo, NSW Australia
| | - Cheng Zhang
- School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, China
| | - Marc A Suchard
- Department of Human Genetics, University of California, Los Angeles, USA
- Department of Computational Medicine, University of California, Los Angeles, USA
- Department of Biostatistics, University of California, Los Angeles, USA
| | - Frederick A Matsen
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA USA
- Department of Genome Sciences, University of Washington, Seattle, USA
- Howard Hughes Medical Institute, Fred Hutchinson Cancer Research Center, Seattle, Washington USA
- Computational Biology Program, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., Mail stop: S2-140, Seattle, WA 98109-1024 USA
| |
Collapse
|
9
|
Khodaei M, Owen M, Beerli P. Geodesics to characterize the phylogenetic landscape. PLoS One 2023; 18:e0287350. [PMID: 37352194 PMCID: PMC10289362 DOI: 10.1371/journal.pone.0287350] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Accepted: 06/04/2023] [Indexed: 06/25/2023] Open
Abstract
Phylogenetic trees are fundamental for understanding evolutionary history. However, finding maximum likelihood trees is challenging due to the complexity of the likelihood landscape and the size of tree space. Based on the Billera-Holmes-Vogtmann (BHV) distance between trees, we describe a method to generate intermediate trees on the shortest path between two trees, called pathtrees. These pathtrees give a structured way to generate and visualize part of treespace. They allow investigating intermediate regions between trees of interest, exploring locally optimal trees in topological clusters of treespace, and potentially finding trees of high likelihood unexplored by tree search algorithms. We compared our approach against other tree search tools (Paup*, RAxML, and RevBayes) using the highest likelihood trees and number of new topologies found, and validated the accuracy of the generated treespace. We assess our method using two datasets. The first consists of 23 primate species (CytB, 1141 bp), leading to well-resolved relationships. The second is a dataset of 182 milksnakes (CytB, 1117 bp), containing many similar sequences and complex relationships among individuals. Our method visualizes the treespace using log likelihood as a fitness function. It finds similarly optimal trees as heuristic methods and presents the likelihood landscape at different scales. It found relevant trees that were not found with MCMC methods. The validation measures indicated that our method performed well mapping treespace into lower dimensions. Our method complements heuristic search analyses, and the visualization allows the inspection of likelihood terraces and exploration of treespace areas not visited by heuristic searches.
Collapse
Affiliation(s)
- Marzieh Khodaei
- Department of Scientific Computing, Florida State University, Tallahassee, FL, United States of America
| | - Megan Owen
- Department of Mathematics, Lehman College and Graduate Center, CUNY, NY, NY, United States of America
| | - Peter Beerli
- Department of Scientific Computing, Florida State University, Tallahassee, FL, United States of America
| |
Collapse
|
10
|
Macaulay M, Darling A, Fourment M. Fidelity of hyperbolic space for Bayesian phylogenetic inference. PLoS Comput Biol 2023; 19:e1011084. [PMID: 37099595 PMCID: PMC10166537 DOI: 10.1371/journal.pcbi.1011084] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Revised: 05/08/2023] [Accepted: 04/08/2023] [Indexed: 04/27/2023] Open
Abstract
Bayesian inference for phylogenetics is a gold standard for computing distributions of phylogenies. However, Bayesian phylogenetics faces the challenging computational problem of moving throughout the high-dimensional space of trees. Fortunately, hyperbolic space offers a low dimensional representation of tree-like data. In this paper, we embed genomic sequences as points in hyperbolic space and perform hyperbolic Markov Chain Monte Carlo for Bayesian inference in this space. The posterior probability of an embedding is computed by decoding a neighbour-joining tree from the embedding locations of the sequences. We empirically demonstrate the fidelity of this method on eight data sets. We systematically investigated the effect of embedding dimension and hyperbolic curvature on the performance in these data sets. The sampled posterior distribution recovers the splits and branch lengths to a high degree over a range of curvatures and dimensions. We systematically investigated the effects of the embedding space's curvature and dimension on the Markov Chain's performance, demonstrating the suitability of hyperbolic space for phylogenetic inference.
Collapse
Affiliation(s)
- Matthew Macaulay
- University of Technology Sydney, Australian Institute for Microbiology & Infection, Sydney, Australia
| | | | - Mathieu Fourment
- University of Technology Sydney, Australian Institute for Microbiology & Infection, Sydney, Australia
| |
Collapse
|
11
|
Zhang C. Learnable Topological Features for Phylogenetic Inference via Graph Neural Networks. ARXIV 2023:arXiv:2302.08840v1. [PMID: 36824431 PMCID: PMC9949155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 02/25/2023]
Abstract
Structural information of phylogenetic tree topologies plays an important role in phylogenetic inference. However, finding appropriate topological structures for specific phylogenetic inference tasks often requires significant design effort and domain expertise. In this paper, we propose a novel structural representation method for phylogenetic inference based on learnable topological features. By combining the raw node features that minimize the Dirichlet energy with modern graph representation learning techniques, our learnable topological features can provide efficient structural information of phylogenetic trees that automatically adapts to different downstream tasks without requiring domain expertise. We demonstrate the effectiveness and efficiency of our method on a simulated data tree probability estimation task and a benchmark of challenging real data variational Bayesian phylogenetic inference problems.
Collapse
Affiliation(s)
- Cheng Zhang
- School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, China
| |
Collapse
|
12
|
Xiang C, Gao F, Jakovlić I, Lei H, Hu Y, Zhang H, Zou H, Wang G, Zhang D. Using PhyloSuite for molecular phylogeny and tree-based analyses. IMETA 2023; 2:e87. [PMID: 38868339 PMCID: PMC10989932 DOI: 10.1002/imt2.87] [Citation(s) in RCA: 75] [Impact Index Per Article: 75.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/18/2022] [Revised: 01/04/2023] [Accepted: 01/15/2023] [Indexed: 06/14/2024]
Abstract
Phylogenetic analysis has entered the genomics (multilocus) era. For less experienced researchers, conquering the large number of software programs required for a multilocus-based phylogenetic reconstruction can be somewhat daunting and time-consuming. PhyloSuite, a software with a user-friendly GUI, was designed to make this process more accessible by integrating multiple software programs needed for multilocus and single-gene phylogenies and further streamlining the whole process. In this protocol, we aim to explain how to conduct each step of the phylogenetic pipeline and tree-based analyses in PhyloSuite. We also present a new version of PhyloSuite (v1.2.3), wherein we fixed some bugs, made some optimizations, and introduced some new functions, including a number of tree-based analyses, such as signal-to-noise calculation, saturation analysis, spurious species identification, and etc. The step-by-step protocol includes background information (i.e., what the step does), reasons (i.e., why do the step), and operations (i.e., how to do it). This protocol will help researchers quick-start their way through the multilocus phylogenetic analysis, especially those interested in conducting organelle-based analyses.
Collapse
Affiliation(s)
- Chuan‐Yu Xiang
- State Key Laboratory of Grassland Agro‐Ecosystems, and College of EcologyLanzhou UniversityLanzhouChina
| | - Fangluan Gao
- Institute of Plant Virology, Fujian Agriculture and Forestry UniversityFuzhouChina
| | - Ivan Jakovlić
- State Key Laboratory of Grassland Agro‐Ecosystems, and College of EcologyLanzhou UniversityLanzhouChina
| | - Hong‐Peng Lei
- State Key Laboratory of Grassland Agro‐Ecosystems, and College of EcologyLanzhou UniversityLanzhouChina
| | - Ye Hu
- State Key Laboratory of Grassland Agro‐Ecosystems, and College of EcologyLanzhou UniversityLanzhouChina
| | - Hong Zhang
- State Key Laboratory of Grassland Agro‐Ecosystems, and College of EcologyLanzhou UniversityLanzhouChina
| | - Hong Zou
- Key Laboratory of Aquaculture Disease Control, Ministry of Agriculture, and State Key Laboratory of Freshwater Ecology and Biotechnology, Institute of Hydrobiology, Chinese Academy of SciencesWuhanChina
| | - Gui‐Tang Wang
- Key Laboratory of Aquaculture Disease Control, Ministry of Agriculture, and State Key Laboratory of Freshwater Ecology and Biotechnology, Institute of Hydrobiology, Chinese Academy of SciencesWuhanChina
| | - Dong Zhang
- State Key Laboratory of Grassland Agro‐Ecosystems, and College of EcologyLanzhou UniversityLanzhouChina
| |
Collapse
|
13
|
Chao E, Chato C, Vender R, Olabode AS, Ferreira RC, Poon AFY. Molecular source attribution. PLoS Comput Biol 2022; 18:e1010649. [PMID: 36395093 PMCID: PMC9671344 DOI: 10.1371/journal.pcbi.1010649] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Affiliation(s)
- Elisa Chao
- Department of Pathology and Laboratory Medicine, Western University, London, Ontario, Canada
| | - Connor Chato
- Department of Pathology and Laboratory Medicine, Western University, London, Ontario, Canada
| | - Reid Vender
- Department of Pathology and Laboratory Medicine, Western University, London, Ontario, Canada
- School of Medicine, Queen’s University, Kingston, Ontario, Canada
| | - Abayomi S. Olabode
- Department of Pathology and Laboratory Medicine, Western University, London, Ontario, Canada
| | - Roux-Cil Ferreira
- Department of Pathology and Laboratory Medicine, Western University, London, Ontario, Canada
| | - Art F. Y. Poon
- Department of Pathology and Laboratory Medicine, Western University, London, Ontario, Canada
- * E-mail:
| |
Collapse
|
14
|
Hassler GW, Magee A, Zhang Z, Baele G, Lemey P, Ji X, Fourment M, Suchard MA. Data integration in Bayesian phylogenetics. ANNUAL REVIEW OF STATISTICS AND ITS APPLICATION 2022; 10:353-377. [PMID: 38774036 PMCID: PMC11108065 DOI: 10.1146/annurev-statistics-033021-112532] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2024]
Abstract
Researchers studying the evolution of viral pathogens and other organisms increasingly encounter and use large and complex data sets from multiple different sources. Statistical research in Bayesian phylogenetics has risen to this challenge. Researchers use phylogenetics not only to reconstruct the evolutionary history of a group of organisms, but also to understand the processes that guide its evolution and spread through space and time. To this end, it is now the norm to integrate numerous sources of data. For example, epidemiologists studying the spread of a virus through a region incorporate data including genetic sequences (e.g. DNA), time, location (both continuous and discrete) and environmental covariates (e.g. social connectivity between regions) into a coherent statistical model. Evolutionary biologists routinely do the same with genetic sequences, location, time, fossil and modern phenotypes, and ecological covariates. These complex, hierarchical models readily accommodate both discrete and continuous data and have enormous combined discrete/continuous parameter spaces including, at a minimum, phylogenetic tree topologies and branch lengths. The increased size and complexity of these statistical models have spurred advances in computational methods to make them tractable. We discuss both the modeling and computational advances below, as well as unsolved problems and areas of active research.
Collapse
Affiliation(s)
- Gabriel W Hassler
- Department of Computational Medicine, University of California, Los Angeles, USA, 90095
| | - Andrew Magee
- Department of Biostatistics, University of California, Los Angeles, USA, 90095
| | - Zhenyu Zhang
- Department of Biostatistics, University of California, Los Angeles, USA, 90095
| | - Guy Baele
- Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium, 3000
| | - Philippe Lemey
- Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium, 3000
| | - Xiang Ji
- Department of Mathematics, Tulane University, New Orleans, USA, 70118
| | - Mathieu Fourment
- Australian Institute for Microbiology and Infection, University of Technology Sydney, Ultimo NSW, Australia, 2007
| | - Marc A Suchard
- Department of Computational Medicine, University of California, Los Angeles, USA, 90095
- Department of Biostatistics, University of California, Los Angeles, USA, 90095
- Department of Human Genetics, University of California, Los Angeles, USA, 90095
| |
Collapse
|
15
|
Convergence Rates of Attractive-Repulsive MCMC Algorithms. Methodol Comput Appl Probab 2022. [DOI: 10.1007/s11009-021-09909-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
16
|
Smith MR. Robust Analysis of Phylogenetic Tree Space. Syst Biol 2022; 71:1255-1270. [PMID: 34963003 PMCID: PMC9366458 DOI: 10.1093/sysbio/syab100] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Revised: 12/03/2021] [Accepted: 12/23/2021] [Indexed: 11/13/2022] Open
Abstract
Phylogenetic analyses often produce large numbers of trees. Mapping trees' distribution in "tree space" can illuminate the behavior and performance of search strategies, reveal distinct clusters of optimal trees, and expose differences between different data sources or phylogenetic methods-but the high-dimensional spaces defined by metric distances are necessarily distorted when represented in fewer dimensions. Here, I explore the consequences of this transformation in phylogenetic search results from 128 morphological data sets, using stratigraphic congruence-a complementary aspect of tree similarity-to evaluate the utility of low-dimensional mappings. I find that phylogenetic similarities between cladograms are most accurately depicted in tree spaces derived from information-theoretic tree distances or the quartet distance. Robinson-Foulds tree spaces exhibit prominent distortions and often fail to group trees according to phylogenetic similarity, whereas the strong influence of tree shape on the Kendall-Colijn distance makes its tree space unsuitable for many purposes. Distances mapped into two or even three dimensions often display little correspondence with true distances, which can lead to profound misrepresentation of clustering structure. Without explicit testing, one cannot be confident that a tree space mapping faithfully represents the true distribution of trees, nor that visually evident structure is valid. My recommendations for tree space validation and visualization are implemented in a new graphical user interface in the "TreeDist" R package. [Multidimensional scaling; phylogenetic software; tree distance metrics; treespace projections.].
Collapse
Affiliation(s)
- Martin R Smith
- Department of Earth Sciences, Durham University, Durham, UK
| |
Collapse
|
17
|
Abstract
The ongoing global pandemic has sharply increased the amount of data available to researchers in epidemiology and public health. Unfortunately, few existing analysis tools are capable of exploiting all of the information contained in a pandemic-scale data set, resulting in missed opportunities for improved surveillance and contact tracing. In this paper, we develop the variational Bayesian skyline (VBSKY), a method for fitting Bayesian phylodynamic models to very large pathogen genetic data sets. By combining recent advances in phylodynamic modeling, scalable Bayesian inference and differentiable programming, along with a few tailored heuristics, VBSKY is capable of analyzing thousands of genomes in a few minutes, providing accurate estimates of epidemiologically relevant quantities such as the effective reproduction number and overall sampling effort through time. We illustrate the utility of our method by performing a rapid analysis of a large number of SARS-CoV-2 genomes, and demonstrate that the resulting estimates closely track those derived from alternative sources of public health data.
Collapse
Affiliation(s)
- Caleb Ki
- Department of Statistics, University of Michigan, Ann Arbor, MI, USA
| | - Jonathan Terhorst
- Department of Statistics, University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|
18
|
Cappello L, Kim J, Liu S, Palacios JA. Statistical Challenges in Tracking the Evolution of SARS-CoV-2. Stat Sci 2022; 37:162-182. [PMID: 36034090 PMCID: PMC9409356 DOI: 10.1214/22-sts853] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Genomic surveillance of SARS-CoV-2 has been instrumental in tracking the spread and evolution of the virus during the pandemic. The availability of SARS-CoV-2 molecular sequences isolated from infected individuals, coupled with phylodynamic methods, have provided insights into the origin of the virus, its evolutionary rate, the timing of introductions, the patterns of transmission, and the rise of novel variants that have spread through populations. Despite enormous global efforts of governments, laboratories, and researchers to collect and sequence molecular data, many challenges remain in analyzing and interpreting the data collected. Here, we describe the models and methods currently used to monitor the spread of SARS-CoV-2, discuss long-standing and new statistical challenges, and propose a method for tracking the rise of novel variants during the epidemic.
Collapse
Affiliation(s)
- Lorenzo Cappello
- Departments of Economics and Business, Universitat Pompeu Fabra, 08005, Spain
| | - Jaehee Kim
- Department of Computational Biology, Cornell University, Ithaca, New York 14853, USA\
| | - Sifan Liu
- Department of Statistics, Stanford University, Stanford, California 94305, USA
| | - Julia A Palacios
- Departments of Statistics and Biomedical Data Sciences, Stanford University, Stanford, California 94305, USA
| |
Collapse
|
19
|
Fabreti LG, Höhna S. Convergence assessment for Bayesian phylogenetic analysis using MCMC simulation. Methods Ecol Evol 2021. [DOI: 10.1111/2041-210x.13727] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Luiza Guimarães Fabreti
- GeoBio‐Center LMU Ludwig‐Maximilians‐Universität München Munich Germany
- Department of Earth and Environmental Sciences, Paleontology & Geobiology Ludwig‐Maximilians‐Universität München Munich Germany
| | - Sebastian Höhna
- GeoBio‐Center LMU Ludwig‐Maximilians‐Universität München Munich Germany
- Department of Earth and Environmental Sciences, Paleontology & Geobiology Ludwig‐Maximilians‐Universität München Munich Germany
| |
Collapse
|
20
|
Richards A, Kubatko L. Bayesian-Weighted Triplet and Quartet Methods for Species Tree Inference. Bull Math Biol 2021; 83:93. [PMID: 34297209 DOI: 10.1007/s11538-021-00918-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Accepted: 06/03/2021] [Indexed: 11/26/2022]
Abstract
Inference of the evolutionary histories of species, commonly represented by a species tree, is complicated by the divergent evolutionary history of different parts of the genome. Different loci on the genome can have different histories from the underlying species tree (and each other) due to processes such as incomplete lineage sorting (ILS), gene duplication and loss, and horizontal gene transfer. The multispecies coalescent is a commonly used model for performing inference on species and gene trees in the presence of ILS. This paper introduces Lily-T and Lily-Q, two new methods for species tree inference under the multispecies coalescent. We then compare them to two frequently used methods, SVDQuartets and ASTRAL, using simulated and empirical data. Both methods generally showed improvement over SVDQuartets, and Lily-Q was superior to Lily-T for most simulation settings. The comparison to ASTRAL was more mixed-Lily-Q tended to be better than ASTRAL when the length of recombination-free loci was short, when the coalescent population parameter [Formula: see text] was small, or when the internal branch lengths were longer.
Collapse
Affiliation(s)
- Andrew Richards
- Department of Statistics, The Ohio State University, Columbus, USA
| | - Laura Kubatko
- Department of Statistics, The Ohio State University, Columbus, USA.
- Department of Evolution, Ecology and Organismal Biology, The Ohio State University, Columbus, USA.
| |
Collapse
|
21
|
Harrington SM, Wishingrad V, Thomson RC. Properties of Markov Chain Monte Carlo Performance across Many Empirical Alignments. Mol Biol Evol 2021; 38:1627-1640. [PMID: 33185685 PMCID: PMC8042746 DOI: 10.1093/molbev/msaa295] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Nearly all current Bayesian phylogenetic applications rely on Markov chain Monte Carlo (MCMC) methods to approximate the posterior distribution for trees and other parameters of the model. These approximations are only reliable if Markov chains adequately converge and sample from the joint posterior distribution. Although several studies of phylogenetic MCMC convergence exist, these have focused on simulated data sets or select empirical examples. Therefore, much that is considered common knowledge about MCMC in empirical systems derives from a relatively small family of analyses under ideal conditions. To address this, we present an overview of commonly applied phylogenetic MCMC diagnostics and an assessment of patterns of these diagnostics across more than 18,000 empirical analyses. Many analyses appeared to perform well and failures in convergence were most likely to be detected using the average standard deviation of split frequencies, a diagnostic that compares topologies among independent chains. Different diagnostics yielded different information about failed convergence, demonstrating that multiple diagnostics must be employed to reliably detect problems. The number of taxa and average branch lengths in analyses have clear impacts on MCMC performance, with more taxa and shorter branches leading to more difficult convergence. We show that the usage of models that include both Γ-distributed among-site rate variation and a proportion of invariable sites is not broadly problematic for MCMC convergence but is also unnecessary. Changes to heating and the usage of model-averaged substitution models can both offer improved convergence in some cases, but neither are a panacea.
Collapse
Affiliation(s)
| | - Van Wishingrad
- School of Life Sciences, University of Hawai'i, Honolulu, HI
| | | |
Collapse
|
22
|
Porto DS, Almeida EAB, Pennell MW. Investigating Morphological Complexes Using Informational Dissonance and Bayes Factors: A Case Study in Corbiculate Bees. Syst Biol 2021; 70:295-306. [PMID: 32722788 PMCID: PMC7882150 DOI: 10.1093/sysbio/syaa059] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2019] [Revised: 07/16/2020] [Accepted: 07/17/2020] [Indexed: 11/22/2022] Open
Abstract
It is widely recognized that different regions of a genome often have different evolutionary histories and that ignoring this variation when estimating phylogenies can be misleading. However, the extent to which this is also true for morphological data is still largely unknown. Discordance among morphological traits might plausibly arise due to either variable convergent selection pressures or else phenomena such as hemiplasy. Here, we investigate patterns of discordance among 282 morphological characters, which we scored for 50 bee species particularly targeting corbiculate bees, a group that includes the well-known eusocial honeybees and bumblebees. As a starting point for selecting the most meaningful partitions in the data, we grouped characters as morphological modules, highly integrated trait complexes that as a result of developmental constraints or coordinated selection we expect to share an evolutionary history and trajectory. In order to assess conflict and coherence across and within these morphological modules, we used recently developed approaches for computing Bayesian phylogenetic information allied with model comparisons using Bayes factors. We found that despite considerable conflict among morphological complexes, accounting for among-character and among-partition rate variation with individual gamma distributions, rate multipliers, and linked branch lengths can lead to coherent phylogenetic inference using morphological data. We suggest that evaluating information content and dissonance among partitions is a useful step in estimating phylogenies from morphological data, just as it is with molecular data. Furthermore, we argue that adopting emerging approaches for investigating dissonance in genomic datasets may provide new insights into the integration and evolution of anatomical complexes. [Apidae; entropy; morphological modules; phenotypic integration; phylogenetic information.].
Collapse
Affiliation(s)
- Diego S Porto
- Laboratório de Biologia Comparada e Abelhas (LBCA), Departamento de Biologia, Faculdade de Filosofia, Ciências e Letras de Ribeirão Preto (FFCLRP), Universidade de São Paulo, 14040-901 Ribeirão Preto, SP, Brazil
- Department of Zoology and Biodiversity Research Centre, University of British Columbia, Vancouver BC V6T 1Z4, Canada
- Department of Biological Sciences, Virginia Polytechnic Institute and State University, 926 West Campus Drive, Blacksburg, VA 24061 USA
| | - Eduardo A B Almeida
- Laboratório de Biologia Comparada e Abelhas (LBCA), Departamento de Biologia, Faculdade de Filosofia, Ciências e Letras de Ribeirão Preto (FFCLRP), Universidade de São Paulo, 14040-901 Ribeirão Preto, SP, Brazil
| | - Matthew W Pennell
- Department of Zoology and Biodiversity Research Centre, University of British Columbia, Vancouver BC V6T 1Z4, Canada
| |
Collapse
|
23
|
Meyer X. Adaptive Tree Proposals for Bayesian Phylogenetic Inference. Syst Biol 2021; 70:1015-1032. [PMID: 33515248 PMCID: PMC8357345 DOI: 10.1093/sysbio/syab004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2019] [Revised: 01/07/2021] [Accepted: 01/17/2021] [Indexed: 11/14/2022] Open
Abstract
Bayesian inference of phylogeny with MCMC plays a key role in the study of evolution. Yet, this method still suffers from a practical challenge identified more than two decades ago: designing tree topology proposals that efficiently sample tree spaces. In this article, I introduce the concept of adaptive tree proposals for unrooted topologies, that is tree proposals adapting to the posterior distribution as it is estimated. I use this concept to elaborate two adaptive variants of existing proposals and an adaptive proposal based on a novel design philosophy in which the structure of the proposal is informed by the posterior distribution of trees. I investigate the performance of these proposals by first presenting a metric that captures the performance of each proposal within a mixture of proposals. Using this metric, I compare the performance of the adaptive proposals to the performance of standard and parsimony-guided proposals on 11 empirical datasets. Using adaptive proposals led to consistent performance gains and resulted in up to 18-fold increases in mixing efficiency and 6-fold increases in convergence rate without increasing the computational cost of these analyses.
Collapse
Affiliation(s)
- X Meyer
- Department of Integrative Biology, University of California, Berkeley, California 94720, USA
| |
Collapse
|
24
|
Müller NF, Bouckaert RR. Adaptive Metropolis-coupled MCMC for BEAST 2. PeerJ 2020; 8:e9473. [PMID: 32995072 PMCID: PMC7501786 DOI: 10.7717/peerj.9473] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2020] [Accepted: 06/12/2020] [Indexed: 11/25/2022] Open
Abstract
With ever more complex models used to study evolutionary patterns, approaches that facilitate efficient inference under such models are needed. Metropolis-coupled Markov chain Monte Carlo (MCMC) has long been used to speed up phylogenetic analyses and to make use of multi-core CPUs. Metropolis-coupled MCMC essentially runs multiple MCMC chains in parallel. All chains are heated except for one cold chain that explores the posterior probability space like a regular MCMC chain. This heating allows chains to make bigger jumps in phylogenetic state space. The heated chains can then be used to propose new states for other chains, including the cold chain. One of the practical challenges using this approach, is to find optimal temperatures of the heated chains to efficiently explore state spaces. We here provide an adaptive Metropolis-coupled MCMC scheme to Bayesian phylogenetics, where the temperature difference between heated chains is automatically tuned to achieve a target acceptance probability of states being exchanged between individual chains. We first show the validity of this approach by comparing inferences of adaptive Metropolis-coupled MCMC to MCMC on several datasets. We then explore where Metropolis-coupled MCMC provides benefits over MCMC. We implemented this adaptive Metropolis-coupled MCMC approach as an open source package licenced under GPL 3.0 to the Bayesian phylogenetics software BEAST 2, available from https://github.com/nicfel/CoupledMCMC.
Collapse
Affiliation(s)
- Nicola F Müller
- Department of Biosystems Science and Engineering, ETH Zürich, Basel, Switzerland.,Swiss Institute of Bioinformatics, Lausanne, Switzerland.,Fred Hutchinson Cancer Research Center, Seattle, Washington, Switzerland
| | - Remco R Bouckaert
- School of Computer Science, University of Auckland, Auckland, New Zealand.,Max Planck Institute for the Science of Human History, Jena, Germany
| |
Collapse
|
25
|
Zhang C, Huelsenbeck JP, Ronquist F. Using Parsimony-Guided Tree Proposals to Accelerate Convergence in Bayesian Phylogenetic Inference. Syst Biol 2020; 69:1016-1032. [PMID: 31985810 PMCID: PMC7440752 DOI: 10.1093/sysbio/syaa002] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2018] [Revised: 01/15/2020] [Accepted: 01/17/2020] [Indexed: 12/18/2022] Open
Abstract
Sampling across tree space is one of the major challenges in Bayesian phylogenetic inference using Markov chain Monte Carlo (MCMC) algorithms. Standard MCMC tree moves consider small random perturbations of the topology, and select from candidate trees at random or based on the distance between the old and new topologies. MCMC algorithms using such moves tend to get trapped in tree space, making them slow in finding the globally most probable trees (known as "convergence") and in estimating the correct proportions of the different types of them (known as "mixing"). Here, we introduce a new class of moves, which propose trees based on their parsimony scores. The proposal distribution derived from the parsimony scores is a quickly computable albeit rough approximation of the conditional posterior distribution over candidate trees. We demonstrate with simulations that parsimony-guided moves correctly sample the uniform distribution of topologies from the prior. We then evaluate their performance against standard moves using six challenging empirical data sets, for which we were able to obtain accurate reference estimates of the posterior using long MCMC runs, a mix of topology proposals, and Metropolis coupling. On these data sets, ranging in size from 357 to 934 taxa and from 1740 to 5681 sites, we find that single chains using parsimony-guided moves usually converge an order of magnitude faster than chains using standard moves. They also exhibit better mixing, that is, they cover the most probable trees more quickly. Our results show that tree moves based on quick and dirty estimates of the posterior probability can significantly outperform standard moves. Future research will have to show to what extent the performance of such moves can be improved further by finding better ways of approximating the posterior probability, taking the trade-off between accuracy and speed into account. [Bayesian phylogenetic inference; MCMC; parsimony; tree proposal.].
Collapse
Affiliation(s)
- Chi Zhang
- Key Laboratory of Vertebrate Evolution and Human Origins, Institute of Vertebrate Paleontology and Paleoanthropology, Chinese Academy of Sciences, 142 XizhimenWai Street, Beijing 100044, China
- Center for Excellence in Life and Paleoenvironment, Chinese Academy of Sciences, 142 XizhimenWai Street, Beijing 100044, China
| | - John P Huelsenbeck
- Department of Integrative Biology, University of California, Berkeley, CA 94720, USA
| | - Fredrik Ronquist
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Box 50007, SE-10405 Stockholm, Sweden
| |
Collapse
|
26
|
Fischer M, Francis A. The Space of Tree-Based Phylogenetic Networks. Bull Math Biol 2020; 82:70. [PMID: 32500263 DOI: 10.1007/s11538-020-00744-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2019] [Accepted: 05/05/2020] [Indexed: 10/24/2022]
Abstract
Phylogenetic networks are generalizations of phylogenetic trees that allow the representation of reticulation events such as horizontal gene transfer or hybridization, and can also represent uncertainty in inference. A subclass of these, tree-based phylogenetic networks, have been introduced to capture the extent to which reticulate evolution nevertheless broadly follows tree-like patterns. Several important operations that change a general phylogenetic network have been developed in recent years and are important for allowing algorithms to move around spaces of networks; a vital ingredient in finding an optimal network given some biological data. A key such operation is the nearest neighbour interchange, or NNI. While it is already known that the space of unrooted phylogenetic networks is connected under NNI, it has been unclear whether this also holds for the subspace of tree-based networks. In this paper, we show that the space of unrooted tree-based phylogenetic networks is indeed connected under the NNI operation. We do so by explicitly showing how to get from one such network to another one without losing tree-basedness along the way. Moreover, we introduce some new concepts, for instance "shoat networks", and derive some interesting aspects concerning tree-basedness. Last, we use our results to derive an upper bound on the size of the space of tree-based networks.
Collapse
Affiliation(s)
- Mareike Fischer
- Institute of Mathematics and Computer Science, University of Greifswald, Greifswald, Germany
| | - Andrew Francis
- Centre for Research in Mathematics and Data Science, Western Sydney University, Sydney, Australia.
| |
Collapse
|
27
|
Fourment M, Magee AF, Whidden C, Bilge A, Matsen FA, Minin VN. 19 Dubious Ways to Compute the Marginal Likelihood of a Phylogenetic Tree Topology. Syst Biol 2020; 69:209-220. [PMID: 31504998 DOI: 10.1093/sysbio/syz046] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2018] [Revised: 06/27/2019] [Accepted: 07/02/2019] [Indexed: 11/12/2022] Open
Abstract
The marginal likelihood of a model is a key quantity for assessing the evidence provided by the data in support of a model. The marginal likelihood is the normalizing constant for the posterior density, obtained by integrating the product of the likelihood and the prior with respect to model parameters. Thus, the computational burden of computing the marginal likelihood scales with the dimension of the parameter space. In phylogenetics, where we work with tree topologies that are high-dimensional models, standard approaches to computing marginal likelihoods are very slow. Here, we study methods to quickly compute the marginal likelihood of a single fixed tree topology. We benchmark the speed and accuracy of 19 different methods to compute the marginal likelihood of phylogenetic topologies on a suite of real data sets under the JC69 model. These methods include several new ones that we develop explicitly to solve this problem, as well as existing algorithms that we apply to phylogenetic models for the first time. Altogether, our results show that the accuracy of these methods varies widely, and that accuracy does not necessarily correlate with computational burden. Our newly developed methods are orders of magnitude faster than standard approaches, and in some cases, their accuracy rivals the best established estimators.
Collapse
Affiliation(s)
- Mathieu Fourment
- University of Technology Sydney, ithree Institute, Ultimo NSW 2007, Australia
| | - Andrew F Magee
- Department of Biology, University of Washington, Seattle, WA 98195, USA
| | - Chris Whidden
- Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Arman Bilge
- Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | | | - Vladimir N Minin
- Department of Statistics, University of California, Irvine, CA 92697, USA
| |
Collapse
|
28
|
Whidden C, Claywell BC, Fisher T, Magee AF, Fourment M, Matsen FA. Systematic Exploration of the High Likelihood Set of Phylogenetic Tree Topologies. Syst Biol 2020; 69:280-293. [PMID: 31504997 DOI: 10.1093/sysbio/syz047] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2018] [Revised: 05/29/2019] [Accepted: 04/09/2019] [Indexed: 11/12/2022] Open
Abstract
Bayesian Markov chain Monte Carlo explores tree space slowly, in part because it frequently returns to the same tree topology. An alternative strategy would be to explore tree space systematically, and never return to the same topology. In this article, we present an efficient parallelized method to map out the high likelihood set of phylogenetic tree topologies via systematic search, which we show to be a good approximation of the high posterior set of tree topologies on the data sets analyzed. Here, "likelihood" of a topology refers to the tree likelihood for the corresponding tree with optimized branch lengths. We call this method "phylogenetic topographer" (PT). The PT strategy is very simple: starting in a number of local topology maxima (obtained by hill-climbing from random starting points), explore out using local topology rearrangements, only continuing through topologies that are better than some likelihood threshold below the best observed topology. We show that the normalized topology likelihoods are a useful proxy for the Bayesian posterior probability of those topologies. By using a nonblocking hash table keyed on unique representations of tree topologies, we avoid visiting topologies more than once across all concurrent threads exploring tree space. We demonstrate that PT can be used directly to approximate a Bayesian consensus tree topology. When combined with an accurate means of evaluating per-topology marginal likelihoods, PT gives an alternative procedure for obtaining Bayesian posterior distributions on phylogenetic tree topologies.
Collapse
Affiliation(s)
- Chris Whidden
- Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | | | - Thayer Fisher
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA
| | - Andrew F Magee
- Department of Biology, University of Washington, Seattle, WA 98195, USA
| | - Mathieu Fourment
- ithree institute, University of Technology Sydney, Sydney, Australia
| | | |
Collapse
|
29
|
Dudas G, Bedford T. The ability of single genes vs full genomes to resolve time and space in outbreak analysis. BMC Evol Biol 2019; 19:232. [PMID: 31878875 PMCID: PMC6933756 DOI: 10.1186/s12862-019-1567-0] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 12/17/2019] [Indexed: 12/12/2022] Open
Abstract
Background Inexpensive pathogen genome sequencing has had a transformative effect on the field of phylodynamics, where ever increasing volumes of data have promised real-time insight into outbreaks of infectious disease. As well as the sheer volume of pathogen isolates being sequenced, the sequencing of whole pathogen genomes, rather than select loci, has allowed phylogenetic analyses to be carried out at finer time scales, often approaching serial intervals for infections caused by rapidly evolving RNA viruses. Despite its utility, whole genome sequencing of pathogens has not been adopted universally and targeted sequencing of loci is common in some pathogen-specific fields. Results In this study we highlighted the utility of sequencing whole genomes of pathogens by re-analysing a well-characterised collection of Ebola virus sequences in the form of complete viral genomes (≈19 kb long) or the rapidly evolving glycoprotein (GP, ≈2 kb long) gene. We have quantified changes in phylogenetic, temporal, and spatial inference resolution as a result of this reduction in data and compared these to theoretical expectations. Conclusions We propose a simple intuitive metric for quantifying temporal resolution, i.e. the time scale over which sequence data might be informative of various processes as a quick back-of-the-envelope calculation of statistical power available to molecular clock analyses.
Collapse
Affiliation(s)
- Gytis Dudas
- Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, Seattle, 98109, USA. .,Gothenburg Global Biodiversity Centre, Carl Skottsbergs gata 22B, Gothenburg, 413 19, Sweden.
| | - Trevor Bedford
- Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, Seattle, 98109, USA
| |
Collapse
|
30
|
Fourment M, Darling AE. Evaluating probabilistic programming and fast variational Bayesian inference in phylogenetics. PeerJ 2019; 7:e8272. [PMID: 31976168 PMCID: PMC6966998 DOI: 10.7717/peerj.8272] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2019] [Accepted: 11/22/2019] [Indexed: 12/21/2022] Open
Abstract
Recent advances in statistical machine learning techniques have led to the creation of probabilistic programming frameworks. These frameworks enable probabilistic models to be rapidly prototyped and fit to data using scalable approximation methods such as variational inference. In this work, we explore the use of the Stan language for probabilistic programming in application to phylogenetic models. We show that many commonly used phylogenetic models including the general time reversible substitution model, rate heterogeneity among sites, and a range of coalescent models can be implemented using a probabilistic programming language. The posterior probability distributions obtained via the black box variational inference engine in Stan were compared to those obtained with reference implementations of Markov chain Monte Carlo (MCMC) for phylogenetic inference. We find that black box variational inference in Stan is less accurate than MCMC methods for phylogenetic models, but requires far less compute time. Finally, we evaluate a custom implementation of mean-field variational inference on the Jukes-Cantor substitution model and show that a specialized implementation of variational inference can be two orders of magnitude faster and more accurate than a general purpose probabilistic implementation.
Collapse
Affiliation(s)
- Mathieu Fourment
- ithree Institute, University of Technology Sydney, Sydney, NSW, Australia
| | - Aaron E. Darling
- ithree Institute, University of Technology Sydney, Sydney, NSW, Australia
| |
Collapse
|
31
|
Palacios JA, Véber A, Cappello L, Wang Z, Wakeley J, Ramachandran S. Bayesian Estimation of Population Size Changes by Sampling Tajima's Trees. Genetics 2019; 213:967-986. [PMID: 31511299 PMCID: PMC6827370 DOI: 10.1534/genetics.119.302373] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2019] [Accepted: 09/06/2019] [Indexed: 11/30/2022] Open
Abstract
The large state space of gene genealogies is a major hurdle for inference methods based on Kingman's coalescent. Here, we present a new Bayesian approach for inferring past population sizes, which relies on a lower-resolution coalescent process that we refer to as "Tajima's coalescent." Tajima's coalescent has a drastically smaller state space, and hence it is a computationally more efficient model, than the standard Kingman coalescent. We provide a new algorithm for efficient and exact likelihood calculations for data without recombination, which exploits a directed acyclic graph and a correspondingly tailored Markov Chain Monte Carlo method. We compare the performance of our Bayesian Estimation of population size changes by Sampling Tajima's Trees (BESTT) with a popular implementation of coalescent-based inference in BEAST using simulated and human data. We empirically demonstrate that BESTT can accurately infer effective population sizes, and it further provides an efficient alternative to the Kingman's coalescent. The algorithms described here are implemented in the R package phylodyn, which is available for download at https://github.com/JuliaPalacios/phylodyn.
Collapse
Affiliation(s)
- Julia A Palacios
- Department of Statistics, Stanford University, California 94305
- Department of Biomedical Data Science, Stanford School of Medicine, California 94305
| | - Amandine Véber
- Centre de Mathématiques Appliquées, École Polytechnique 91128, Le Centre National de la Recherche Scientifique, Palaiseau, France 91767
| | | | - Zhangyuan Wang
- Department of Computer Science, Stanford University, California 94305
| | - John Wakeley
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, Massachusetts 02138
| | - Sohini Ramachandran
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island 02912
- Department of Ecology and Evolutionary Biology, Brown University, Providence, Rhode Island 02912
| |
Collapse
|
32
|
Biju VC, P R S, Vijayan S, Rajan VS, Sasi A, Janardhanan A, Nair AS. The Complete Chloroplast Genome of Trichopus zeylanicus, And Phylogenetic Analysis with Dioscoreales. THE PLANT GENOME 2019; 12:1-11. [PMID: 33016590 DOI: 10.3835/plantgenome2019.04.0032] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/26/2019] [Accepted: 09/25/2019] [Indexed: 06/11/2023]
Abstract
We presents the first chloroplast genome from the genus Trichopus. Comparative analysis revealed that the IR regions are more conserved than the SC regions. Highly divergent sequence hot spots were identified, which could be used as molecular markers. Phylogenetic analysis gave insight into the evolutionary history of Trichopus zeylanicus. In this study, we determined the complete sequence of the chloroplast genome of an important, rare, and endangered medicinal plant, Trichopus zeylanicus. The analysis of the genome showed that the complete chloroplast genome of Trichopus zeylanicus is 153,497 bp in size, and has a quadripartite structure with a large single copy of 81,091 bp and a small single copy of 17,512 bp separated by inverted repeats of 27,447 bp. Sequence analysis revealed that the chloroplast genome encodes 112 unique genes, including 78 protein-coding genes, 30 tRNA genes, and four rRNA genes. We also identified 95 simple sequence repeats and 54 long repeats including 34 forward repeats, seven inverted repeats, nine palindromes, three reverse repeats, and one complementary repeat within the chloroplast genome of Trichopus zeylanicus. Whole chloroplast genome comparison with those of other Dioscoreales indicated that the inverted regions are more conserved than large single copy and small single copy regions. In the phylogenetic trees based on complete chloroplast genome and 78 shared chloroplast protein-coding genes in 15 monocot species, including 14 Dioscoreales, Trichopus zeylanicus formed a distinct clade. In summary, the first chloroplast genome from the genus Trichopus reported in this study gave a better insight into the phylogenetic relationships of different genera within the order Dioscoreales. Moreover, the present data will be a valuable chloroplast genomic resource for population genetics.
Collapse
Affiliation(s)
| | - Shidhi P R
- Dep. of Computational Biology and Bioinformatics, Univ. of Kerala, Thiruvananthapuram, Kerala, India
| | - Sheethal Vijayan
- Dep. of Computational Biology and Bioinformatics, Univ. of Kerala, Thiruvananthapuram, Kerala, India
| | - Veena S Rajan
- Dep. of Computational Biology and Bioinformatics, Univ. of Kerala, Thiruvananthapuram, Kerala, India
| | - Anu Sasi
- Dep. of Computational Biology and Bioinformatics, Univ. of Kerala, Thiruvananthapuram, Kerala, India
| | - Akhil Janardhanan
- Dep. of Computational Biology and Bioinformatics, Univ. of Kerala, Thiruvananthapuram, Kerala, India
| | - Achuthsankar S Nair
- Dep. of Computational Biology and Bioinformatics, Univ. of Kerala, Thiruvananthapuram, Kerala, India
| |
Collapse
|
33
|
Vadakkemukadiyil Chellappan B, Pr S, Vijayan S, Rajan VS, Sasi A, Nair AS. High Quality Draft Genome of Arogyapacha ( Trichopus zeylanicus), an Important Medicinal Plant Endemic to Western Ghats of India. G3 (BETHESDA, MD.) 2019; 9:2395-2404. [PMID: 31189529 PMCID: PMC6686938 DOI: 10.1534/g3.119.400164] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/11/2019] [Accepted: 06/05/2019] [Indexed: 12/21/2022]
Abstract
Arogyapacha, the local name of Trichopus zeylanicus, is a rare, indigenous medicinal plant of India. This plant is famous for its traditional use as an instant energy stimulant. So far, no genomic resource is available for this important plant and hence its metabolic pathways are poorly understood. Here, we report on a high-quality draft assembly of approximately 713.4 Mb genome of T. zeylanicus, first draft genome from the genus Trichopus The assembly was generated in a hybrid approach using Illumina short-reads and Pacbio longer-reads. The total assembly comprised of 22601 scaffolds with an N50 value of 433.3 Kb. We predicted 34452 protein coding genes in T. zeylanicus genome and found that a significant portion of these predicted genes were associated with various secondary metabolite biosynthetic pathways. Comparative genome analysis revealed extensive gene collinearity between T. zeylanicus and its closely related plant species. The present genome and annotation data provide an essential resource to speed-up the research on secondary metabolism, breeding and molecular evolution of T. zeylanicus.
Collapse
Affiliation(s)
| | - Shidhi Pr
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala, India
| | - Sheethal Vijayan
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala, India
| | - Veena S Rajan
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala, India
| | - Anu Sasi
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala, India
| | - Achuthsankar S Nair
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala, India
| |
Collapse
|
34
|
Russel PM, Brewer BJ, Klaere S, Bouckaert RR. Model Selection and Parameter Inference in Phylogenetics Using Nested Sampling. Syst Biol 2019; 68:219-233. [PMID: 29961836 DOI: 10.1093/sysbio/syy050] [Citation(s) in RCA: 69] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2016] [Accepted: 06/22/2018] [Indexed: 12/28/2022] Open
Abstract
Bayesian inference methods rely on numerical algorithms for both model selection and parameter inference. In general, these algorithms require a high computational effort to yield reliable estimates. One of the major challenges in phylogenetics is the estimation of the marginal likelihood. This quantity is commonly used for comparing different evolutionary models, but its calculation, even for simple models, incurs high computational cost. Another interesting challenge relates to the estimation of the posterior distribution. Often, long Markov chains are required to get sufficient samples to carry out parameter inference, especially for tree distributions. In general, these problems are addressed separately by using different procedures. Nested sampling (NS) is a Bayesian computation algorithm, which provides the means to estimate marginal likelihoods together with their uncertainties, and to sample from the posterior distribution at no extra cost. The methods currently used in phylogenetics for marginal likelihood estimation lack in practicality due to their dependence on many tuning parameters and their inability of most implementations to provide a direct way to calculate the uncertainties associated with the estimates, unlike NS. In this article, we introduce NS to phylogenetics. Its performance is analysed under different scenarios and compared to established methods. We conclude that NS is a competitive and attractive algorithm for phylogenetic inference. An implementation is available as a package for BEAST 2 under the LGPL licence, accessible at https://github.com/BEAST2-Dev/nested-sampling.
Collapse
Affiliation(s)
| | - Brendon J Brewer
- Department of Statistics, The University of Auckland, Auckland, New Zealand
| | - Steffen Klaere
- Department of Statistics, The University of Auckland, Auckland, New Zealand.,School of Biological Sciences, University of Auckland, Auckland, New Zealand
| | - Remco R Bouckaert
- Center of Computational Evolution, University of Auckland, Auckland, New Zealand.,Max Planck Institute for the Science of Human History, Jena, Germany
| |
Collapse
|
35
|
Garba MK, Nye TMW, Boys RJ. Probabilistic Distances Between Trees. Syst Biol 2018; 67:320-327. [PMID: 29029295 PMCID: PMC5837584 DOI: 10.1093/sysbio/syx080] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2017] [Accepted: 09/18/2017] [Indexed: 12/03/2022] Open
Abstract
Most existing measures of distance between phylogenetic trees are based on the geometry or topology of the trees. Instead, we consider distance measures which are based on the underlying probability distributions on genetic sequence data induced by trees. Monte Carlo schemes are necessary to calculate these distances approximately, and we describe efficient sampling procedures. Key features of the distances are the ability to include substitution model parameters and to handle trees with different taxon sets in a principled way. We demonstrate some of the properties of these new distance measures and compare them to existing distances, in particular by applying multidimensional scaling to data sets previously reported as containing phylogenetic islands. [Metric; probability distribution; multidimensional scaling; information geometry.
Collapse
Affiliation(s)
- Maryam K Garba
- School of Mathematics & Statistics, Newcastle University, Newcastle upon Tyne NE1 7RU, UK.,Department of Mathematical Sciences, Bayero University, Kano, Nigeria
| | - Tom M W Nye
- School of Mathematics & Statistics, Newcastle University, Newcastle upon Tyne NE1 7RU, UK
| | - Richard J Boys
- School of Mathematics & Statistics, Newcastle University, Newcastle upon Tyne NE1 7RU, UK
| |
Collapse
|
36
|
Whidden C, Matsen F. Calculating the Unrooted Subtree Prune-and-Regraft Distance. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 16:898-911. [PMID: 29994585 DOI: 10.1109/tcbb.2018.2802911] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The subtree prune-and-regraft (SPR) distance metric is a fundamental way of comparing evolutionary trees. It has wide-ranging applications, such as to study lateral genetic transfer, viral recombination, and Markov chain Monte Carlo phylogenetic inference. Although the rooted version of SPR distance can be computed relatively efficiently between rooted trees using fixed-parameter-tractable maximum agreement forest (MAF) algorithms, no MAF formulation is known for the unrooted case. Correspondingly, previous algorithms are unable to compute unrooted SPR distances larger than 7.
Collapse
|
37
|
Copetti D, Búrquez A, Bustamante E, Charboneau JLM, Childs KL, Eguiarte LE, Lee S, Liu TL, McMahon MM, Whiteman NK, Wing RA, Wojciechowski MF, Sanderson MJ. Extensive gene tree discordance and hemiplasy shaped the genomes of North American columnar cacti. Proc Natl Acad Sci U S A 2017; 114:12003-12008. [PMID: 29078296 PMCID: PMC5692538 DOI: 10.1073/pnas.1706367114] [Citation(s) in RCA: 54] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Few clades of plants have proven as difficult to classify as cacti. One explanation may be an unusually high level of convergent and parallel evolution (homoplasy). To evaluate support for this phylogenetic hypothesis at the molecular level, we sequenced the genomes of four cacti in the especially problematic tribe Pachycereeae, which contains most of the large columnar cacti of Mexico and adjacent areas, including the iconic saguaro cactus (Carnegiea gigantea) of the Sonoran Desert. We assembled a high-coverage draft genome for saguaro and lower coverage genomes for three other genera of tribe Pachycereeae (Pachycereus, Lophocereus, and Stenocereus) and a more distant outgroup cactus, Pereskia We used these to construct 4,436 orthologous gene alignments. Species tree inference consistently returned the same phylogeny, but gene tree discordance was high: 37% of gene trees having at least 90% bootstrap support conflicted with the species tree. Evidently, discordance is a product of long generation times and moderately large effective population sizes, leading to extensive incomplete lineage sorting (ILS). In the best supported gene trees, 58% of apparent homoplasy at amino sites in the species tree is due to gene tree-species tree discordance rather than parallel substitutions in the gene trees themselves, a phenomenon termed "hemiplasy." The high rate of genomic hemiplasy may contribute to apparent parallelisms in phenotypic traits, which could confound understanding of species relationships and character evolution in cacti.
Collapse
Affiliation(s)
- Dario Copetti
- Arizona Genomics Institute, School of Plant Sciences, University of Arizona, Tucson, AZ 85721
- International Rice Research Institute, Los Baños, Laguna, Philippines
| | - Alberto Búrquez
- Instituto de Ecología, Unidad Hermosillo, Universidad Nacional Autónoma de México, Hermosillo, Sonora, Mexico
| | - Enriquena Bustamante
- Instituto de Ecología, Unidad Hermosillo, Universidad Nacional Autónoma de México, Hermosillo, Sonora, Mexico
| | - Joseph L M Charboneau
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721
| | - Kevin L Childs
- Department of Plant Biology, Michigan State University, East Lansing, MI 48824
| | - Luis E Eguiarte
- Departamento de Ecología Evolutiva, Instituto de Ecología, Universidad Nacional Autónoma de México, Ciudad de México, Mexico
| | - Seunghee Lee
- Arizona Genomics Institute, School of Plant Sciences, University of Arizona, Tucson, AZ 85721
| | - Tiffany L Liu
- Department of Plant Biology, Michigan State University, East Lansing, MI 48824
| | | | - Noah K Whiteman
- Department of Integrative Biology, University of California, Berkeley, CA 94720
| | - Rod A Wing
- Arizona Genomics Institute, School of Plant Sciences, University of Arizona, Tucson, AZ 85721
- International Rice Research Institute, Los Baños, Laguna, Philippines
| | | | - Michael J Sanderson
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721;
| |
Collapse
|
38
|
The combinatorics of discrete time-trees: theory and open problems. J Math Biol 2017; 76:1101-1121. [PMID: 28756523 PMCID: PMC5829145 DOI: 10.1007/s00285-017-1167-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2016] [Revised: 07/14/2017] [Indexed: 10/30/2022]
Abstract
A time-tree is a rooted phylogenetic tree such that all internal nodes are equipped with absolute divergence dates and all leaf nodes are equipped with sampling dates. Such time-trees have become a central object of study in phylogenetics but little is known about the parameter space of such objects. Here we introduce and study a hierarchy of discrete approximations of the space of time-trees from the graph-theoretic and algorithmic point of view. One of the basic and widely used phylogenetic graphs, the [Formula: see text] graph, is the roughest approximation and bottom level of our hierarchy. More refined approximations discretize the relative timing of evolutionary divergence and sampling dates. We study basic graph-theoretic questions for these graphs, including the size of neighborhoods, diameter upper and lower bounds, and the problem of finding shortest paths. We settle many of these questions by extending the concept of graph grammars introduced by Sleator, Tarjan, and Thurston to our graphs. Although time values greatly increase the number of possible trees, we show that 1-neighborhood sizes remain linear, allowing for efficient local exploration and construction of these graphs. We also obtain upper bounds on the r-neighborhood sizes of these graphs, including a smaller bound than was previously known for [Formula: see text]. Our results open up a number of possible directions for theoretical investigation of graph-theoretic and algorithmic properties of the time-tree graphs. We discuss the directions that are most valuable for phylogenetic applications and give a list of prominent open problems for those applications. In particular, we conjecture that the split theorem applies to shortest paths in time-tree graphs, a property not shared in the general [Formula: see text] case.
Collapse
|
39
|
Bordewich M, Linz S, Semple C. Lost in space? Generalising subtree prune and regraft to spaces of phylogenetic networks. J Theor Biol 2017; 423:1-12. [PMID: 28414085 DOI: 10.1016/j.jtbi.2017.03.032] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2016] [Revised: 03/18/2017] [Accepted: 03/20/2017] [Indexed: 10/19/2022]
Abstract
Over the last fifteen years, phylogenetic networks have become a popular tool to analyse relationships between species whose past includes reticulation events such as hybridisation or horizontal gene transfer. However, the space of phylogenetic networks is significantly larger than that of phylogenetic trees, and how to analyse and search this enlarged space remains a poorly understood problem. Inspired by the widely-used rooted subtree prune and regraft (rSPR) operation on rooted phylogenetic trees, we propose a new operation-called subnet prune and regraft (SNPR)-that induces a metric on the space of all rooted phylogenetic networks on a fixed set of leaves. We show that the spaces of several popular classes of rooted phylogenetic networks (e.g. tree child, reticulation visible, and tree based) are connected under SNPR and that connectedness remains for the subclasses of these networks with a fixed number of reticulations. Lastly, we bound the distance between two rooted phylogenetic networks under the SNPR operation, show that it is computationally hard to compute this distance exactly, and analyse how the SNPR-distance between two such networks relates to the rSPR-distance between rooted phylogenetic trees that are embedded in these networks.
Collapse
Affiliation(s)
- Magnus Bordewich
- School of Engineering and Computing Sciences, Durham University, Durham DH1 3LE, United Kingdom.
| | - Simone Linz
- Department of Computer Science, The University of Auckland, Private Bag 92019, Auckland 1142, New Zealand.
| | - Charles Semple
- School of Mathematics and Statistics, University of Canterbury, Private Bag 4800, Christchurch 8140, New Zealand.
| |
Collapse
|
40
|
Ali RH, Bark M, Miró J, Muhammad SA, Sjöstrand J, Zubair SM, Abbas RM, Arvestad L. VMCMC: a graphical and statistical analysis tool for Markov chain Monte Carlo traces. BMC Bioinformatics 2017; 18:97. [PMID: 28187712 PMCID: PMC5301390 DOI: 10.1186/s12859-017-1505-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2015] [Accepted: 01/28/2017] [Indexed: 12/21/2022] Open
Abstract
Background MCMC-based methods are important for Bayesian inference of phylogeny and related parameters. Although being computationally expensive, MCMC yields estimates of posterior distributions that are useful for estimating parameter values and are easy to use in subsequent analysis. There are, however, sometimes practical difficulties with MCMC, relating to convergence assessment and determining burn-in, especially in large-scale analyses. Currently, multiple software are required to perform, e.g., convergence, mixing and interactive exploration of both continuous and tree parameters. Results We have written a software called VMCMC to simplify post-processing of MCMC traces with, for example, automatic burn-in estimation. VMCMC can also be used both as a GUI-based application, supporting interactive exploration, and as a command-line tool suitable for automated pipelines. Conclusions VMCMC is a free software available under the New BSD License. Executable jar files, tutorial manual and source code can be downloaded from https://bitbucket.org/rhali/visualmcmc/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1505-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Raja H Ali
- KTH Royal Institute of Technology, Swedish e-Science Research Centre, Science for Life Laboratory, School of Computer Science and Communication, Solna, SE-171 77, Sweden
| | - Mikael Bark
- KTH Royal Institute of Technology, School of Information and Communication Technology, Kista, SE-164 40, Sweden
| | - Jorge Miró
- KTH Royal Institute of Technology, School of Information and Communication Technology, Kista, SE-164 40, Sweden
| | - Sayyed A Muhammad
- KTH Royal Institute of Technology, Swedish e-Science Research Centre, Science for Life Laboratory, School of Computer Science and Communication, Solna, SE-171 77, Sweden
| | - Joel Sjöstrand
- Department of Numerical Analysis and Computer Science, Swedish e-Science Research Centre, Science for Life Laboratory, Stockholm University, Stockholm, SE-100 44, Sweden
| | - Syed M Zubair
- KTH Royal Institute of Technology, Laboratory for Communication Networks, School of Electrical Engineering, Stockholm, SE-100 44, Sweden.,Department of Computer Science and Information Technology, University of Balochistan, Quetta, PK-87 300, Pakistan
| | - Raja M Abbas
- Department of Computer Science and Engineering, University of Gothenburg, Gothenburg, SE-411 37, Sweden
| | - Lars Arvestad
- Department of Numerical Analysis and Computer Science, Swedish e-Science Research Centre, Science for Life Laboratory, Stockholm University, Stockholm, SE-100 44, Sweden.
| |
Collapse
|
41
|
St. John K. Review Paper: The Shape of Phylogenetic Treespace. Syst Biol 2017; 66:e83-e94. [PMID: 28173538 PMCID: PMC5837343 DOI: 10.1093/sysbio/syw025] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2015] [Revised: 12/16/2015] [Accepted: 03/22/2016] [Indexed: 11/23/2022] Open
Abstract
Trees are a canonical structure for representing evolutionary histories. Many popular criteria used to infer optimal trees are computationally hard, and the number of possible tree shapes grows super-exponentially in the number of taxa. The underlying structure of the spaces of trees yields rich insights that can improve the search for optimal trees, both in accuracy and in running time, and the analysis and visualization of results. We review the past work on analyzing and comparing trees by their shape as well as recent work that incorporates trees with weighted branch lengths.
Collapse
Affiliation(s)
- Katherine St. John
- Department of Mathematics and Computer Science, Lehman College, NY 10034, USA
| |
Collapse
|
42
|
Eaton DAR, Spriggs EL, Park B, Donoghue MJ. Misconceptions on Missing Data in RAD-seq Phylogenetics with a Deep-scale Example from Flowering Plants. Syst Biol 2016; 66:399-412. [DOI: 10.1093/sysbio/syw092] [Citation(s) in RCA: 72] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2016] [Accepted: 10/10/2016] [Indexed: 01/08/2023] Open
|
43
|
Lanfear R, Hua X, Warren DL. Estimating the Effective Sample Size of Tree Topologies from Bayesian Phylogenetic Analyses. Genome Biol Evol 2016; 8:2319-32. [PMID: 27435794 PMCID: PMC5010905 DOI: 10.1093/gbe/evw171] [Citation(s) in RCA: 44] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Bayesian phylogenetic analyses estimate posterior distributions of phylogenetic tree topologies and other parameters using Markov chain Monte Carlo (MCMC) methods. Before making inferences from these distributions, it is important to assess their adequacy. To this end, the effective sample size (ESS) estimates how many truly independent samples of a given parameter the output of the MCMC represents. The ESS of a parameter is frequently much lower than the number of samples taken from the MCMC because sequential samples from the chain can be non-independent due to autocorrelation. Typically, phylogeneticists use a rule of thumb that the ESS of all parameters should be greater than 200. However, we have no method to calculate an ESS of tree topology samples, despite the fact that the tree topology is often the parameter of primary interest and is almost always central to the estimation of other parameters. That is, we lack a method to determine whether we have adequately sampled one of the most important parameters in our analyses. In this study, we address this problem by developing methods to estimate the ESS for tree topologies. We combine these methods with two new diagnostic plots for assessing posterior samples of tree topologies, and compare their performance on simulated and empirical data sets. Combined, the methods we present provide new ways to assess the mixing and convergence of phylogenetic tree topologies in Bayesian MCMC analyses.
Collapse
Affiliation(s)
- Robert Lanfear
- Department of Biological Sciences, Macquarie University, Sydney, Australia Ecology, Evolution, and Genetics, Australian National University, Canberra, Australia
| | - Xia Hua
- Ecology, Evolution, and Genetics, Australian National University, Canberra, Australia
| | - Dan L Warren
- Department of Biological Sciences, Macquarie University, Sydney, Australia Ecology, Evolution, and Genetics, Australian National University, Canberra, Australia
| |
Collapse
|
44
|
Lewis PO, Chen MH, Kuo L, Lewis LA, Fučíková K, Neupane S, Wang YB, Shi D. Estimating Bayesian Phylogenetic Information Content. Syst Biol 2016; 65:1009-1023. [PMID: 27155008 PMCID: PMC5066063 DOI: 10.1093/sysbio/syw042] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2016] [Revised: 04/15/2016] [Accepted: 05/01/2016] [Indexed: 11/13/2022] Open
Abstract
Measuring the phylogenetic information content of data has a long history in systematics. Here we explore a Bayesian approach to information content estimation. The entropy of the posterior distribution compared with the entropy of the prior distribution provides a natural way to measure information content. If the data have no information relevant to ranking tree topologies beyond the information supplied by the prior, the posterior and prior will be identical. Information in data discourages consideration of some hypotheses allowed by the prior, resulting in a posterior distribution that is more concentrated (has lower entropy) than the prior. We focus on measuring information about tree topology using marginal posterior distributions of tree topologies. We show that both the accuracy and the computational efficiency of topological information content estimation improve with use of the conditional clade distribution, which also allows topological information content to be partitioned by clade. We explore two important applications of our method: providing a compelling definition of saturation and detecting conflict among data partitions that can negatively affect analyses of concatenated data. [Bayesian; concatenation; conditional clade distribution; entropy; information; phylogenetics; saturation.].
Collapse
Affiliation(s)
- Paul O Lewis
- Department of Ecology and Evolutionary Biology, University of Connecticut, 75 N. Eagleville Road, Unit 3043, Storrs, CT 06269, USA;
| | - Ming-Hui Chen
- Department of Statistics, University of Connecticut, 215 Glenbrook Road, Unit 4120, Storrs, CT 06269, USA
| | - Lynn Kuo
- Department of Statistics, University of Connecticut, 215 Glenbrook Road, Unit 4120, Storrs, CT 06269, USA
| | - Louise A Lewis
- Department of Ecology and Evolutionary Biology, University of Connecticut, 75 N. Eagleville Road, Unit 3043, Storrs, CT 06269, USA
| | - Karolina Fučíková
- Department of Ecology and Evolutionary Biology, University of Connecticut, 75 N. Eagleville Road, Unit 3043, Storrs, CT 06269, USA
| | - Suman Neupane
- Department of Ecology and Evolutionary Biology, University of Connecticut, 75 N. Eagleville Road, Unit 3043, Storrs, CT 06269, USA
| | - Yu-Bo Wang
- Department of Statistics, University of Connecticut, 215 Glenbrook Road, Unit 4120, Storrs, CT 06269, USA
| | - Daoyuan Shi
- Department of Statistics, University of Connecticut, 215 Glenbrook Road, Unit 4120, Storrs, CT 06269, USA
| |
Collapse
|
45
|
Lewitus E, Morlon H. Characterizing and Comparing Phylogenies from their Laplacian Spectrum. Syst Biol 2015; 65:495-507. [PMID: 26658901 DOI: 10.1093/sysbio/syv116] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2015] [Accepted: 12/04/2015] [Indexed: 11/14/2022] Open
Abstract
Phylogenetic trees are central to many areas of biology, ranging from population genetics and epidemiology to microbiology, ecology, and macroevolution. The ability to summarize properties of trees, compare different trees, and identify distinct modes of division within trees is essential to all these research areas. But despite wide-ranging applications, there currently exists no common, comprehensive framework for such analyses. Here we present a graph-theoretical approach that provides such a framework. We show how to construct the spectral density profile of a phylogenetic tree from its Laplacian graph. Using ultrametric simulated trees as well as non-ultrametric empirical trees, we demonstrate that the spectral density successfully identifies various properties of the trees and clusters them into meaningful groups. Finally, we illustrate how the eigengap can identify modes of division within a given tree. As phylogenetic data continue to accumulate and to be integrated into various areas of the life sciences, we expect that this spectral graph-theoretical framework to phylogenetics will have powerful and long-lasting applications.
Collapse
Affiliation(s)
- Eric Lewitus
- Institut de Biologie (IBENS), École Normale Supérieure, Paris, France;
| | - Helene Morlon
- Institut de Biologie (IBENS), École Normale Supérieure, Paris, France
| |
Collapse
|
46
|
Zhang X, Wang Y, Wang J, Sun F. Protein-protein interactions among signaling pathways may become new therapeutic targets in liver cancer (Review). Oncol Rep 2015; 35:625-38. [PMID: 26717966 DOI: 10.3892/or.2015.4464] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2015] [Accepted: 07/06/2015] [Indexed: 11/05/2022] Open
Abstract
Numerous signaling pathways have been shown to be dysregulated in liver cancer. In addition, some protein-protein interactions are prerequisite for the uncontrolled activation or inhibition of these signaling pathways. For instance, in the PI3K/AKT signaling pathway, protein AKT binds with a number of proteins such as mTOR, FOXO1 and MDM2 to play an oncogenic role in liver cancer. The aim of the present review was to focus on a series of important protein-protein interactions that can serve as potential therapeutic targets in liver cancer among certain important pro-carcinogenic signaling pathways. The strategies of how to investigate and analyze the protein-protein interactions are also included in this review. A survey of these protein interactions may provide alternative therapeutic targets in liver cancer.
Collapse
Affiliation(s)
- Xiao Zhang
- Department of Clinical Laboratory Medicine, Shanghai Tenth People's Hospital of Tongji University, Shanghai 200072, P.R. China
| | - Yulan Wang
- Department of Clinical Laboratory Medicine, Shanghai Tenth People's Hospital of Tongji University, Shanghai 200072, P.R. China
| | - Jiayi Wang
- Department of Clinical Laboratory Medicine, Shanghai Tenth People's Hospital of Tongji University, Shanghai 200072, P.R. China
| | - Fenyong Sun
- Department of Clinical Laboratory Medicine, Shanghai Tenth People's Hospital of Tongji University, Shanghai 200072, P.R. China
| |
Collapse
|