1
|
Collienne L, Whidden C, Gavryushkin A. Ranked Subtree Prune and Regraft. Bull Math Biol 2024; 86:24. [PMID: 38294587 PMCID: PMC10830682 DOI: 10.1007/s11538-023-01244-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 12/06/2023] [Indexed: 02/01/2024]
Abstract
Phylogenetic trees are a mathematical formalisation of evolutionary histories between organisms, species, genes, cancer cells, etc. For many applications, e.g. when analysing virus transmission trees or cancer evolution, (phylogenetic) time trees are of interest, where branch lengths represent times. Computational methods for reconstructing time trees from (typically molecular) sequence data, for example Bayesian phylogenetic inference using Markov Chain Monte Carlo (MCMC) methods, rely on algorithms that sample the treespace. They employ tree rearrangement operations such as [Formula: see text] (Subtree Prune and Regraft) and [Formula: see text] (Nearest Neighbour Interchange) or, in the case of time tree inference, versions of these that take times of internal nodes into account. While the classic [Formula: see text] tree rearrangement is well-studied, its variants for time trees are less understood, limiting comparative analysis for time tree methods. In this paper we consider a modification of the classical [Formula: see text] rearrangement on the space of ranked phylogenetic trees, which are trees equipped with a ranking of all internal nodes. This modification results in two novel treespaces, which we propose to study. We begin this study by discussing algorithmic properties of these treespaces, focusing on those relating to the complexity of computing distances under the ranked [Formula: see text] operations as well as similarities and differences to known tree rearrangement based treespaces. Surprisingly, we show the counterintuitive result that adding leaves to trees can actually decrease their ranked [Formula: see text] distance, which may have an impact on the results of time tree sampling algorithms given uncertain "rogue taxa".
Collapse
Affiliation(s)
- Lena Collienne
- Biological Data Science Laboratory, School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand.
| | - Chris Whidden
- Faculty of Computer Science, Dalhousie University, Halifax, Canada
| | - Alex Gavryushkin
- Biological Data Science Laboratory, School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand
- Biomathematics Research Centre, University of Canterbury, Christchurch, New Zealand
| |
Collapse
|
2
|
Xuan R, Gao J, Lin Q, Yue W, Liu T, Hu S, Song G. Mitochondrial DNA Diversity of Mesocricetus auratus and Other Cricetinae Species among Cricetidae Family. Biochem Genet 2022; 60:1881-1894. [PMID: 35122557 PMCID: PMC8817650 DOI: 10.1007/s10528-022-10195-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Accepted: 01/25/2022] [Indexed: 11/29/2022]
Abstract
Unique anatomical and physiological features have made hamster species desirable research models. Comparative genomics and phylogenetic analysis of the hamster family members to clarify their evolution and genetic relationship, can provide a genetic basis for the comprehension of the variable research results obtained using different hamster models. The Syrian golden hamster (Mesocricetus auratus) is the most widely used species. In this study, we sequenced the complete mitochondrial genome (mitogenome) of M. auratus, compared it with the mitogenome of other Cricetinae subfamily species, and defined its phylogenetic position in the Cricetidae family. Our results show that the mitogenome organization, gene arrangement, base composition, and genetic analysis of the protein coding genes (PCGs) of M. auratus are similar to those observed in previous reports on Cricetinae species. Nonetheless, our analysis clarifies some striking differences of M. auratus relative to other subfamily members, namely distinct codon usage frequency of TAT (Tyr), AAT (Asn), and GAA (Glu) and the presence of the conserved sequence block 3 (CSB-3) in the control region of M. auratus mitogenome and other hamsters (not found in Arvicolinae). These results suggest the particularity of amino acid codon usage bias of M. auratus and special regulatory signals for the heavy strand replication in Cricetinae. Additionally, Bayesian inference/maximum likelihood (BI/ML) tree shows that Cricetinae and Arvicolinae are sister taxa sharing a common ancestor, and Neotominae split prior to the split between Cricetinae and Arvicolinae. Our results support taxonomy revisions in Cricetulus kamensis and Cricetulus migratorius, and further revision is needed within the other two subfamilies. Among the hamster research models, Cricetulus griseus is the species with highest sequence similarity and closer genetic relationship with M. auratus. Our results show mitochondrial DNA diversity of M. auratus and other Cricetinae species and provide genetic basis for judgement of different hamster models, promoting the development and usage of hamsters with regional characteristics.
Collapse
Affiliation(s)
- Ruijing Xuan
- Laboratory Animal Center, Shanxi Medical University, Taiyuan, 030001, China
| | - Jiping Gao
- Laboratory Animal Center, Shanxi Medical University, Taiyuan, 030001, China
| | - Qiang Lin
- Key Laboratory of Genome Information and Sciences, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100029, China
| | - Wenbin Yue
- College of Animal Science and Technology, Shanxi Agricultural University, Taigu, 030801, China
| | - Tianfu Liu
- Laboratory Animal Center, Shanxi Medical University, Taiyuan, 030001, China
| | - Songnian Hu
- Key Laboratory of Genome Information and Sciences, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100029, China
| | - Guohua Song
- Laboratory Animal Center, Shanxi Medical University, Taiyuan, 030001, China.
| |
Collapse
|
3
|
Fisher AA, Hassler GW, Ji X, Baele G, Suchard MA, Lemey P. Scalable Bayesian phylogenetics. Philos Trans R Soc Lond B Biol Sci 2022; 377:20210242. [PMID: 35989603 PMCID: PMC9393558 DOI: 10.1098/rstb.2021.0242] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
Recent advances in Bayesian phylogenetics offer substantial computational savings to accommodate increased genomic sampling that challenges traditional inference methods. In this review, we begin with a brief summary of the Bayesian phylogenetic framework, and then conceptualize a variety of methods to improve posterior approximations via Markov chain Monte Carlo (MCMC) sampling. Specifically, we discuss methods to improve the speed of likelihood calculations, reduce MCMC burn-in, and generate better MCMC proposals. We apply several of these techniques to study the evolution of HIV virulence along a 1536-tip phylogeny and estimate the internal node heights of a 1000-tip SARS-CoV-2 phylogenetic tree in order to illustrate the speed-up of such analyses using current state-of-the-art approaches. We conclude our review with a discussion of promising alternatives to MCMC that approximate the phylogenetic posterior. This article is part of a discussion meeting issue 'Genomic population structures of microbial pathogens'.
Collapse
Affiliation(s)
| | - Gabriel W. Hassler
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, University of California, Los Angeles, CA 90095, USA
| | - Xiang Ji
- Department of Mathematics, School of Science and Engineering, Tulane University, New Orleans, LA 70118, USA
| | - Guy Baele
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, 3000 Leuven, Belgium
| | - Marc A. Suchard
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, University of California, Los Angeles, CA 90095, USA,Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California, Los Angeles, CA 90095, USA,Department of Human Genetics, David Geffen School of Medicine at UCLA, University of California, Los Angeles, CA 90095, USA
| | - Philippe Lemey
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, 3000 Leuven, Belgium
| |
Collapse
|
4
|
An adjacent-swap Markov chain on coalescent trees. J Appl Probab 2022. [DOI: 10.1017/jpr.2022.15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Abstract
The standard coalescent is widely used in evolutionary biology and population genetics to model the ancestral history of a sample of molecular sequences as a rooted and ranked binary tree. In this paper we present a representation of the space of ranked trees as a space of constrained ordered matched pairs. We use this representation to define ergodic Markov chains on labeled and unlabeled ranked tree shapes analogously to transposition chains on the space of permutations. We show that an adjacent-swap chain on labeled and unlabeled ranked tree shapes has a mixing time at least of order
$n^3$
, and at most of order
$n^{4}$
. Bayesian inference methods rely on Markov chain Monte Carlo methods on the space of trees. Thus it is important to define good Markov chains which are easy to simulate and for which rates of convergence can be studied.
Collapse
|
5
|
Cappello L, Kim J, Liu S, Palacios JA. Statistical Challenges in Tracking the Evolution of SARS-CoV-2. Stat Sci 2022; 37:162-182. [PMID: 36034090 PMCID: PMC9409356 DOI: 10.1214/22-sts853] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Genomic surveillance of SARS-CoV-2 has been instrumental in tracking the spread and evolution of the virus during the pandemic. The availability of SARS-CoV-2 molecular sequences isolated from infected individuals, coupled with phylodynamic methods, have provided insights into the origin of the virus, its evolutionary rate, the timing of introductions, the patterns of transmission, and the rise of novel variants that have spread through populations. Despite enormous global efforts of governments, laboratories, and researchers to collect and sequence molecular data, many challenges remain in analyzing and interpreting the data collected. Here, we describe the models and methods currently used to monitor the spread of SARS-CoV-2, discuss long-standing and new statistical challenges, and propose a method for tracking the rise of novel variants during the epidemic.
Collapse
Affiliation(s)
- Lorenzo Cappello
- Lorenzo Cappello is Assistant Professor, Departments of Economics and Business, Universitat Pompeu Fabra, 08005, Spain
| | - Jaehee Kim
- Jaehee Kim is Assistant Professor, Department of Computational Biology, Cornell University, Ithaca, New York 14853, USA
| | - Sifan Liu
- Sifan Liu is a Ph.D. student, Department of Statistics, Stanford University, Stanford, California 94305, USA
| | - Julia A. Palacios
- Julia A. Palacios is Assistant Professor, Departments of Statistics and Biomedical Data Sciences, Stanford University, Stanford, California 94305, USA
| |
Collapse
|
6
|
Didelot X, Siveroni I, Volz EM. Additive Uncorrelated Relaxed Clock Models for the Dating of Genomic Epidemiology Phylogenies. Mol Biol Evol 2021; 38:307-317. [PMID: 32722797 PMCID: PMC8480190 DOI: 10.1093/molbev/msaa193] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Phylogenetic dating is one of the most powerful and commonly used methods of drawing epidemiological interpretations from pathogen genomic data. Building such trees requires considering a molecular clock model which represents the rate at which substitutions accumulate on genomes. When the molecular clock rate is constant throughout the tree then the clock is said to be strict, but this is often not an acceptable assumption. Alternatively, relaxed clock models consider variations in the clock rate, often based on a distribution of rates for each branch. However, we show here that the distributions of rates across branches in commonly used relaxed clock models are incompatible with the biological expectation that the sum of the numbers of substitutions on two neighboring branches should be distributed as the substitution number on a single branch of equivalent length. We call this expectation the additivity property. We further show how assumptions of commonly used relaxed clock models can lead to estimates of evolutionary rates and dates with low precision and biased confidence intervals. We therefore propose a new additive relaxed clock model where the additivity property is satisfied. We illustrate the use of our new additive relaxed clock model on a range of simulated and real data sets, and we show that using this new model leads to more accurate estimates of mean evolutionary rates and ancestral dates.
Collapse
Affiliation(s)
- Xavier Didelot
- School of Life Sciences, University of Warwick, Coventry, United Kingdom.,Department of Statistics, University of Warwick, Coventry, United Kingdom
| | - Igor Siveroni
- Department of Infectious Disease Epidemiology, School of Public Health, Imperial College London, London, United Kingdom
| | - Erik M Volz
- Department of Infectious Disease Epidemiology, School of Public Health, Imperial College London, London, United Kingdom
| |
Collapse
|
7
|
Wang S, Wang L. Particle Gibbs sampling for Bayesian phylogenetic inference. Bioinformatics 2021; 37:642-649. [PMID: 33045053 DOI: 10.1093/bioinformatics/btaa867] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2020] [Revised: 08/10/2020] [Accepted: 09/24/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The combinatorial sequential Monte Carlo (CSMC) has been demonstrated to be an efficient complementary method to the standard Markov chain Monte Carlo (MCMC) for Bayesian phylogenetic tree inference using biological sequences. It is appealing to combine the CSMC and MCMC in the framework of the particle Gibbs (PG) sampler to jointly estimate the phylogenetic trees and evolutionary parameters. However, the Markov chain of the PG may mix poorly for high dimensional problems (e.g. phylogenetic trees). Some remedies, including the PG with ancestor sampling and the interacting particle MCMC, have been proposed to improve the PG. But they either cannot be applied to or remain inefficient for the combinatorial tree space. RESULTS We introduce a novel CSMC method by proposing a more efficient proposal distribution. It also can be combined into the PG sampler framework to infer parameters in the evolutionary model. The new algorithm can be easily parallelized by allocating samples over different computing cores. We validate that the developed CSMC can sample trees more efficiently in various PG samplers via numerical experiments. AVAILABILITY AND IMPLEMENTATION The implementation of our method and the data underlying this article are available at https://github.com/liangliangwangsfu/phyloPMCMC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shijia Wang
- School of Statistic and Data Science, LPMC and KLMDASR, Nankai University, Nankai Qu 300071, China
| | - Liangliang Wang
- Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
| |
Collapse
|
8
|
Gill MS, Lemey P, Suchard MA, Rambaut A, Baele G. Online Bayesian Phylodynamic Inference in BEAST with Application to Epidemic Reconstruction. Mol Biol Evol 2020; 37:1832-1842. [PMID: 32101295 PMCID: PMC7253210 DOI: 10.1093/molbev/msaa047] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Reconstructing pathogen dynamics from genetic data as they become available during an outbreak or epidemic represents an important statistical scenario in which observations arrive sequentially in time and one is interested in performing inference in an "online" fashion. Widely used Bayesian phylogenetic inference packages are not set up for this purpose, generally requiring one to recompute trees and evolutionary model parameters de novo when new data arrive. To accommodate increasing data flow in a Bayesian phylogenetic framework, we introduce a methodology to efficiently update the posterior distribution with newly available genetic data. Our procedure is implemented in the BEAST 1.10 software package, and relies on a distance-based measure to insert new taxa into the current estimate of the phylogeny and imputes plausible values for new model parameters to accommodate growing dimensionality. This augmentation creates informed starting values and re-uses optimally tuned transition kernels for posterior exploration of growing data sets, reducing the time necessary to converge to target posterior distributions. We apply our framework to data from the recent West African Ebola virus epidemic and demonstrate a considerable reduction in time required to obtain posterior estimates at different time points of the outbreak. Beyond epidemic monitoring, this framework easily finds other applications within the phylogenetics community, where changes in the data-in terms of alignment changes, sequence addition or removal-present common scenarios that can benefit from online inference.
Collapse
Affiliation(s)
- Mandev S Gill
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| | - Philippe Lemey
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| | - Marc A Suchard
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA
- Department of Biostatistics, School of Public Health, University of California, Los Angeles, CA
- Department of Biomathematics, David Geffen School of Medicine, University of California, Los Angeles, CA
| | - Andrew Rambaut
- Institute of Evolutionary Biology, University of Edinburgh, United Kingdom
- Fogarty International Center, National Institutes of Health, Bethesda, MD
| | - Guy Baele
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| |
Collapse
|
9
|
R Oaks J, A Cobb K, N Minin V, D Leaché A. Marginal Likelihoods in Phylogenetics: A Review of Methods and Applications. Syst Biol 2019; 68:681-697. [PMID: 30668834 PMCID: PMC6701458 DOI: 10.1093/sysbio/syz003] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2018] [Revised: 01/14/2019] [Accepted: 01/15/2019] [Indexed: 11/29/2022] Open
Abstract
By providing a framework of accounting for the shared ancestry inherent to all life, phylogenetics is becoming the statistical foundation of biology. The importance of model choice continues to grow as phylogenetic models continue to increase in complexity to better capture micro- and macroevolutionary processes. In a Bayesian framework, the marginal likelihood is how data update our prior beliefs about models, which gives us an intuitive measure of comparing model fit that is grounded in probability theory. Given the rapid increase in the number and complexity of phylogenetic models, methods for approximating marginal likelihoods are increasingly important. Here, we try to provide an intuitive description of marginal likelihoods and why they are important in Bayesian model testing. We also categorize and review methods for estimating marginal likelihoods of phylogenetic models, highlighting several recent methods that provide well-behaved estimates. Furthermore, we review some empirical studies that demonstrate how marginal likelihoods can be used to learn about models of evolution from biological data. We discuss promising alternatives that can complement marginal likelihoods for Bayesian model choice, including posterior-predictive methods. Using simulations, we find one alternative method based on approximate-Bayesian computation to be biased. We conclude by discussing the challenges of Bayesian model choice and future directions that promise to improve the approximation of marginal likelihoods and Bayesian phylogenetics as a whole.
Collapse
Affiliation(s)
- Jamie R Oaks
- Department of Biological Sciences and Museum of Natural History, Auburn University, Auburn, AL 36849, USA
- Correspondence to be sent to: Department of Biological Sciences and Museum of Natural History, Auburn University, Auburn, AL 36849, USA; E-mail:
| | - Kerry A Cobb
- Department of Biological Sciences and Museum of Natural History, Auburn University, Auburn, AL 36849, USA
| | - Vladimir N Minin
- Department of Statistics, University of California, Irvine, CA 92697, USA
| | - Adam D Leaché
- Department of Biology and Burke Museum of Natural History and Culture, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
10
|
Everitt RG, Culliford R, Medina-Aguayo F, Wilson DJ. Sequential Monte Carlo with transformations. STATISTICS AND COMPUTING 2019; 30:663-676. [PMID: 32116416 PMCID: PMC7026014 DOI: 10.1007/s11222-019-09903-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/17/2018] [Accepted: 09/03/2019] [Indexed: 06/10/2023]
Abstract
This paper examines methodology for performing Bayesian inference sequentially on a sequence of posteriors on spaces of different dimensions. For this, we use sequential Monte Carlo samplers, introducing the innovation of using deterministic transformations to move particles effectively between target distributions with different dimensions. This approach, combined with adaptive methods, yields an extremely flexible and general algorithm for Bayesian model comparison that is suitable for use in applications where the acceptance rate in reversible jump Markov chain Monte Carlo is low. We use this approach on model comparison for mixture models, and for inferring coalescent trees sequentially, as data arrives.
Collapse
Affiliation(s)
| | - Richard Culliford
- Department of Mathematics and Statistics, University of Reading, Reading, UK
| | | | - Daniel J. Wilson
- Nuffield Department of Medicine, University of Oxford, Oxford, UK
| |
Collapse
|
11
|
Wang L, Wang S, Bouchard-Côté A. An Annealed Sequential Monte Carlo Method for Bayesian Phylogenetics. Syst Biol 2019; 69:155-183. [DOI: 10.1093/sysbio/syz028] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2018] [Revised: 04/12/2019] [Accepted: 04/20/2019] [Indexed: 01/07/2023] Open
Abstract
Abstract
We describe an “embarrassingly parallel” method for Bayesian phylogenetic inference, annealed Sequential Monte Carlo (SMC), based on recent advances in the SMC literature such as adaptive determination of annealing parameters. The algorithm provides an approximate posterior distribution over trees and evolutionary parameters as well as an unbiased estimator for the marginal likelihood. This unbiasedness property can be used for the purpose of testing the correctness of posterior simulation software. We evaluate the performance of phylogenetic annealed SMC by reviewing and comparing with other computational Bayesian phylogenetic methods, in particular, different marginal likelihood estimation methods. Unlike previous SMC methods in phylogenetics, our annealed method can utilize standard Markov chain Monte Carlo (MCMC) tree moves and hence benefit from the large inventory of such moves available in the literature. Consequently, the annealed SMC method should be relatively easy to incorporate into existing phylogenetic software packages based on MCMC algorithms. We illustrate our method using simulation studies and real data analysis.
Collapse
Affiliation(s)
- Liangliang Wang
- Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, British Columbia V5A 1S6, Canada
| | - Shijia Wang
- Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, British Columbia V5A 1S6, Canada
| | - Alexandre Bouchard-Côté
- Department of Statistics, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada
| |
Collapse
|
12
|
Claywell BC, Dinh V, Fourment M, McCoy CO, Matsen Iv FA. A Surrogate Function for One-Dimensional Phylogenetic Likelihoods. Mol Biol Evol 2019; 35:242-246. [PMID: 29029199 DOI: 10.1093/molbev/msx253] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Phylogenetics has seen a steady increase in data set size and substitution model complexity, which require increasing amounts of computational power to compute likelihoods. This motivates strategies to approximate the likelihood functions for branch length optimization and Bayesian sampling. In this article, we develop an approximation to the 1D likelihood function as parametrized by a single branch length. Our method uses a four-parameter surrogate function abstracted from the simplest phylogenetic likelihood function, the binary symmetric model. We show that it offers a surrogate that can be fit over a variety of branch lengths, that it is applicable to a wide variety of models and trees, and that it can be used effectively as a proposal mechanism for Bayesian sampling. The method is implemented as a stand-alone open-source C library for calling from phylogenetics algorithms; it has proven essential for good performance of our online phylogenetic algorithm sts.
Collapse
Affiliation(s)
- Brian C Claywell
- Program in Computational Biology, Fred Hutchinson Cancer Research Center, Seattle, WA
| | - Vu Dinh
- Department of Mathematical Sciences, University of Delaware, Newark, DE
| | - Mathieu Fourment
- ithree Institute, University of Technology Sydney, Ultimo, NSW, Australia
| | - Connor O McCoy
- Program in Computational Biology, Fred Hutchinson Cancer Research Center, Seattle, WA
| | - Frederick A Matsen Iv
- Program in Computational Biology, Fred Hutchinson Cancer Research Center, Seattle, WA
| |
Collapse
|
13
|
Fourment M, Claywell BC, Dinh V, McCoy C, Matsen Iv FA, Darling AE. Effective Online Bayesian Phylogenetics via Sequential Monte Carlo with Guided Proposals. Syst Biol 2018; 67:490-502. [PMID: 29186587 PMCID: PMC5920299 DOI: 10.1093/sysbio/syx090] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2017] [Accepted: 11/20/2017] [Indexed: 11/14/2022] Open
Abstract
Modern infectious disease outbreak surveillance produces continuous streams of sequence data which require phylogenetic analysis as data arrives. Current software packages for Bayesian phylogenetic inference are unable to quickly incorporate new sequences as they become available, making them less useful for dynamically unfolding evolutionary stories. This limitation can be addressed by applying a class of Bayesian statistical inference algorithms called sequential Monte Carlo (SMC) to conduct online inference, wherein new data can be continuously incorporated to update the estimate of the posterior probability distribution. In this article, we describe and evaluate several different online phylogenetic sequential Monte Carlo (OPSMC) algorithms. We show that proposing new phylogenies with a density similar to the Bayesian prior suffers from poor performance, and we develop “guided” proposals that better match the proposal density to the posterior. Furthermore, we show that the simplest guided proposals can exhibit pathological behavior in some situations, leading to poor results, and that the situation can be resolved by heating the proposal density. The results demonstrate that relative to the widely used MCMC-based algorithm implemented in MrBayes, the total time required to compute a series of phylogenetic posteriors as sequences arrive can be significantly reduced by the use of OPSMC, without incurring a significant loss in accuracy.
Collapse
Affiliation(s)
- Mathieu Fourment
- ithree institute, University of Technology Sydney, Ultimo, NSW 2007, Australia
| | | | - Vu Dinh
- Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Connor McCoy
- Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | | | - Aaron E Darling
- ithree institute, University of Technology Sydney, Ultimo, NSW 2007, Australia
| |
Collapse
|
14
|
Baele G, Dellicour S, Suchard MA, Lemey P, Vrancken B. Recent advances in computational phylodynamics. Curr Opin Virol 2018; 31:24-32. [PMID: 30248578 DOI: 10.1016/j.coviro.2018.08.009] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2018] [Revised: 08/16/2018] [Accepted: 08/20/2018] [Indexed: 01/02/2023]
Abstract
Time-stamped, trait-annotated phylogenetic trees built from virus genome data are increasingly used for outbreak investigation and monitoring ongoing epidemics. This routinely involves reconstructing the spatial and demographic processes from large data sets to help unveil the patterns and drivers of virus spread. Such phylodynamic inferences can however become quite time-consuming as the dimensions of the data increase, which has led to a myriad of approaches that aim to tackle this complexity. To elucidate the current state of the art in the field of phylodynamics, we discuss recent developments in Bayesian inference and accompanying software, highlight methods for improving computational efficiency and relevant visualisation tools. As an alternative to fully Bayesian approaches, we touch upon conditional software pipelines that compromise between statistical coherence and turn-around-time, and we highlight the available software packages. Finally, we outline future directions that may facilitate the large-scale tracking of epidemics in near real time.
Collapse
Affiliation(s)
- Guy Baele
- KU Leuven Department of Microbiology and Immunology, Rega Institute, Laboratory of Evolutionary and Computational Virology, Leuven, Belgium.
| | - Simon Dellicour
- KU Leuven Department of Microbiology and Immunology, Rega Institute, Laboratory of Evolutionary and Computational Virology, Leuven, Belgium; Spatial Epidemiology Lab (SpELL), Université Libre de Bruxelles, Bruxelles, Belgium
| | - Marc A Suchard
- Department of Biomathematics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA; Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles, CA, USA; Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
| | - Philippe Lemey
- KU Leuven Department of Microbiology and Immunology, Rega Institute, Laboratory of Evolutionary and Computational Virology, Leuven, Belgium
| | - Bram Vrancken
- KU Leuven Department of Microbiology and Immunology, Rega Institute, Laboratory of Evolutionary and Computational Virology, Leuven, Belgium
| |
Collapse
|