1
|
Magee AF, Holbrook AJ, Pekar JE, Caviedes-Solis IW, Matsen Iv FA, Baele G, Wertheim JO, Ji X, Lemey P, Suchard MA. Random-Effects Substitution Models for Phylogenetics via Scalable Gradient Approximations. Syst Biol 2024; 73:562-578. [PMID: 38712512 DOI: 10.1093/sysbio/syae019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Revised: 02/26/2024] [Accepted: 05/02/2024] [Indexed: 05/08/2024] Open
Abstract
Phylogenetic and discrete-trait evolutionary inference depend heavily on an appropriate characterization of the underlying character substitution process. In this paper, we present random-effects substitution models that extend common continuous-time Markov chain models into a richer class of processes capable of capturing a wider variety of substitution dynamics. As these random-effects substitution models often require many more parameters than their usual counterparts, inference can be both statistically and computationally challenging. Thus, we also propose an efficient approach to compute an approximation to the gradient of the data likelihood with respect to all unknown substitution model parameters. We demonstrate that this approximate gradient enables scaling of sampling-based inference, namely Bayesian inference via Hamiltonian Monte Carlo, under random-effects substitution models across large trees and state-spaces. Applied to a dataset of 583 SARS-CoV-2 sequences, an HKY model with random-effects shows strong signals of nonreversibility in the substitution process, and posterior predictive model checks clearly show that it is a more adequate model than a reversible model. When analyzing the pattern of phylogeographic spread of 1441 influenza A virus (H3N2) sequences between 14 regions, a random-effects phylogeographic substitution model infers that air travel volume adequately predicts almost all dispersal rates. A random-effects state-dependent substitution model reveals no evidence for an effect of arboreality on the swimming mode in the tree frog subfamily Hylinae. Simulations reveal that random-effects substitution models can accommodate both negligible and radical departures from the underlying base substitution model. We show that our gradient-based inference approach is over an order of magnitude more time efficient than conventional approaches.
Collapse
Affiliation(s)
- Andrew F Magee
- Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California - Los Angeles, Los Angeles, CA, USA
| | - Andrew J Holbrook
- Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California - Los Angeles, Los Angeles, CA, USA
| | - Jonathan E Pekar
- Bioinformatics and Systems Biology Graduate Program, University of California - San Diego, La Jolla, CA, USA
- Department of Biomedical Informatics, University of California - San Diega, La Jolla, CA, USA
| | | | - Fredrick A Matsen Iv
- Howard Hughes Medical Institute, Seattle, Washington, USA
- Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
- Department of Statistics, University of Washington, Seattle, Washington, USA
| | - Guy Baele
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| | - Joel O Wertheim
- Department of Medicine, University of California - San Diego, La Jolla, CA, USA
| | - Xiang Ji
- Department of Mathematics, Tulane University, New Orleans, LA, USA
| | - Philippe Lemey
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| | - Marc A Suchard
- Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California - Los Angeles, Los Angeles, CA, USA
- Department of Biomathematics, David Geffen School of Medicine at UCLA, University of California - Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, David Geffen School of Medicine at UCLA, University of California - Los Angeles, Los Angeles, CA, USA
| |
Collapse
|
2
|
Gangavarapu K, Ji X, Baele G, Fourment M, Lemey P, Matsen FA, Suchard MA. Many-core algorithms for high-dimensional gradients on phylogenetic trees. Bioinformatics 2024; 40:btae030. [PMID: 38243701 PMCID: PMC10868298 DOI: 10.1093/bioinformatics/btae030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Revised: 12/20/2023] [Accepted: 01/15/2024] [Indexed: 01/21/2024] Open
Abstract
MOTIVATION Advancements in high-throughput genomic sequencing are delivering genomic pathogen data at an unprecedented rate, positioning statistical phylogenetics as a critical tool to monitor infectious diseases globally. This rapid growth spurs the need for efficient inference techniques, such as Hamiltonian Monte Carlo (HMC) in a Bayesian framework, to estimate parameters of these phylogenetic models where the dimensions of the parameters increase with the number of sequences N. HMC requires repeated calculation of the gradient of the data log-likelihood with respect to (wrt) all branch-length-specific (BLS) parameters that traditionally takes O(N2) operations using the standard pruning algorithm. A recent study proposes an approach to calculate this gradient in O(N), enabling researchers to take advantage of gradient-based samplers such as HMC. The CPU implementation of this approach makes the calculation of the gradient computationally tractable for nucleotide-based models but falls short in performance for larger state-space size models, such as Markov-modulated and codon models. Here, we describe novel massively parallel algorithms to calculate the gradient of the log-likelihood wrt all BLS parameters that take advantage of graphics processing units (GPUs) and result in many fold higher speedups over previous CPU implementations. RESULTS We benchmark these GPU algorithms on three computing systems using three evolutionary inference examples exploring complete genomes from 997 dengue viruses, 62 carnivore mitochondria and 49 yeasts, and observe a >128-fold speedup over the CPU implementation for codon-based models and >8-fold speedup for nucleotide-based models. As a practical demonstration, we also estimate the timing of the first introduction of West Nile virus into the continental Unites States under a codon model with a relaxed molecular clock from 104 full viral genomes, an inference task previously intractable. AVAILABILITY AND IMPLEMENTATION We provide an implementation of our GPU algorithms in BEAGLE v4.0.0 (https://github.com/beagle-dev/beagle-lib), an open-source library for statistical phylogenetics that enables parallel calculations on multi-core CPUs and GPUs. We employ a BEAGLE-implementation using the Bayesian phylogenetics framework BEAST (https://github.com/beast-dev/beast-mcmc).
Collapse
Affiliation(s)
- Karthik Gangavarapu
- Department of Biomathematics, David Geffen School of Medicine at UCLA, University of California, Los Angeles, Los Angeles, CA, United States
| | - Xiang Ji
- Department of Mathematics, School of Science & Engineering, Tulane University, New Orleans, LA, United States
| | - Guy Baele
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| | - Mathieu Fourment
- Australian Institute for Microbiology and Infection, University of Technology Sydney, Ultimo, NSW, Australia
| | - Philippe Lemey
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| | - Frederick A Matsen
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA, United States
- Department of Statistics, University of Washington, Seattle, WA, United States
- Department of Genome Sciences, University of Washington, Seattle, WA, United States
- Howard Hughes Medical Institute, Fred Hutchinson Cancer Research Center, Seattle, WA, United States
| | - Marc A Suchard
- Department of Biomathematics, David Geffen School of Medicine at UCLA, University of California, Los Angeles, Los Angeles, CA, United States
- Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California, Los Angeles, Los Angeles, CA, United States
- Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA, United States
| |
Collapse
|
3
|
Bastide P, Didier G. The Cauchy Process on Phylogenies: A Tractable Model for Pulsed Evolution. Syst Biol 2023; 72:1296-1315. [PMID: 37603537 DOI: 10.1093/sysbio/syad053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2023] [Revised: 08/05/2023] [Accepted: 08/14/2023] [Indexed: 08/23/2023] Open
Abstract
Phylogenetic comparative methods use random processes, such as the Brownian Motion, to model the evolution of continuous traits on phylogenetic trees. Growing evidence for non-gradual evolution motivated the development of complex models, often based on Lévy processes. However, their statistical inference is computationally intensive and currently relies on approximations, high-dimensional sampling, or numerical integration. We consider here the Cauchy Process (CP), a particular pure-jump Lévy process in which the trait increment along each branch follows a centered Cauchy distribution with a dispersion proportional to its length. In this work, we derive an exact algorithm to compute both the joint probability density of the tip trait values of a phylogeny under a CP and the ancestral trait values and branch increments posterior densities in quadratic time. A simulation study shows that the CP generates patterns in comparative data that are distinct from any Gaussian process, and that restricted maximum likelihood parameter estimates and root trait reconstruction are unbiased and accurate for trees with 200 tips or less. The CP has only two parameters but is rich enough to capture complex-pulsed evolution. It can reconstruct posterior ancestral trait distributions that are multimodal, reflecting the uncertainty associated with the inference of the evolutionary history of a trait from extant taxa only. Applied on empirical datasets taken from the Evolutionary Ecology and Virology literature, the CP suggests nuanced scenarios for the body size evolution of Greater Antilles Lizards and for the geographical spread of the West Nile Virus epidemics in North America, both consistent with previous studies using more complex models. The method is efficiently implemented in C with an R interface in package cauphy, which is open source and freely available online.
Collapse
Affiliation(s)
- Paul Bastide
- IMAG, Université de Montpellier, CNRS, Montpellier, France
| | - Gilles Didier
- IMAG, Université de Montpellier, CNRS, Montpellier, France
| |
Collapse
|
4
|
Ji X, Fisher AA, Su S, Thorne JL, Potter B, Lemey P, Baele G, Suchard MA. Scalable Bayesian Divergence Time Estimation With Ratio Transformations. Syst Biol 2023; 72:1136-1153. [PMID: 37458991 PMCID: PMC10636426 DOI: 10.1093/sysbio/syad039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2022] [Revised: 06/13/2023] [Accepted: 06/30/2023] [Indexed: 11/08/2023] Open
Abstract
Divergence time estimation is crucial to provide temporal signals for dating biologically important events from species divergence to viral transmissions in space and time. With the advent of high-throughput sequencing, recent Bayesian phylogenetic studies have analyzed hundreds to thousands of sequences. Such large-scale analyses challenge divergence time reconstruction by requiring inference on highly correlated internal node heights that often become computationally infeasible. To overcome this limitation, we explore a ratio transformation that maps the original $N-1$ internal node heights into a space of one height parameter and $N-2$ ratio parameters. To make the analyses scalable, we develop a collection of linear-time algorithms to compute the gradient and Jacobian-associated terms of the log-likelihood with respect to these ratios. We then apply Hamiltonian Monte Carlo sampling with the ratio transform in a Bayesian framework to learn the divergence times in 4 pathogenic viruses (West Nile virus, rabies virus, Lassa virus, and Ebola virus) and the coralline red algae. Our method both resolves a mixing issue in the West Nile virus example and improves inference efficiency by at least 5-fold for the Lassa and rabies virus examples as well as for the algae example. Our method now also makes it computationally feasible to incorporate mixed-effects molecular clock models for the Ebola virus example, confirms the findings from the original study, and reveals clearer multimodal distributions of the divergence times of some clades of interest.
Collapse
Affiliation(s)
- Xiang Ji
- Department of Mathematics, School of Science & Engineering, Tulane University, 6823 St. Charles Avenue, New Orleans, LA 70118, USA
| | - Alexander A Fisher
- Department of Statistical Science, Duke University, 214 Old Chemistry, Durham, NC 27708, USA
| | - Shuo Su
- MOE International Joint Collaborative Research Laboratory for Animal Health & Food Safety, Jiangsu Engineering Laboratory of Animal Immunology, Institute of Immunology, College of Veterinary Medicine, Nanjing Agricultural University, No. 1 Weigang, Xiaolingwei District, Nanjing, Jiangsu 210095, China
| | - Jeffrey L Thorne
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, USA
- Department of Statistics, North Carolina State University, Raleigh, NC, USA
- Department of Biological Sciences, North Carolina State University, Ricks Hall, 1 Lampe Dr, Raleigh, NC 27607, USA
| | - Barney Potter
- Department of Microbiology, Immunology and Transplantation, Rega Institute, Herestraat 49, 3000 Leuven, Belgium
| | - Philippe Lemey
- Department of Microbiology, Immunology and Transplantation, Rega Institute, Herestraat 49, 3000 Leuven, Belgium
| | - Guy Baele
- Department of Microbiology, Immunology and Transplantation, Rega Institute, Herestraat 49, 3000 Leuven, Belgium
| | - Marc A Suchard
- Department of Biomathematics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, USA
- Department of Biostatistics, Fielding School of Public Health, University of California Los Angeles, 695 Charles E Young Dr S, Los Angeles, CA 90095, USA
| |
Collapse
|
5
|
Pekar JE, Lytras S, Ghafari M, Magee AF, Parker E, Havens JL, Katzourakis A, Vasylyeva TI, Suchard MA, Hughes AC, Hughes J, Robertson DL, Dellicour S, Worobey M, Wertheim JO, Lemey P. The recency and geographical origins of the bat viruses ancestral to SARS-CoV and SARS-CoV-2. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.12.548617. [PMID: 37502985 PMCID: PMC10369958 DOI: 10.1101/2023.07.12.548617] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
The emergence of SARS-CoV in 2002 and SARS-CoV-2 in 2019 has led to increased sampling of related sarbecoviruses circulating primarily in horseshoe bats. These viruses undergo frequent recombination and exhibit spatial structuring across Asia. Employing recombination-aware phylogenetic inference on bat sarbecoviruses, we find that the closest-inferred bat virus ancestors of SARS-CoV and SARS-CoV-2 existed just ~1-3 years prior to their emergence in humans. Phylogeographic analyses examining the movement of related sarbecoviruses demonstrate that they traveled at similar rates to their horseshoe bat hosts and have been circulating for thousands of years in Asia. The closest-inferred bat virus ancestor of SARS-CoV likely circulated in western China, and that of SARS-CoV-2 likely circulated in a region comprising southwest China and northern Laos, both a substantial distance from where they emerged. This distance and recency indicate that the direct ancestors of SARS-CoV and SARS-CoV-2 could not have reached their respective sites of emergence via the bat reservoir alone. Our recombination-aware dating and phylogeographic analyses reveal a more accurate inference of evolutionary history than performing only whole-genome or single gene analyses. These results can guide future sampling efforts and demonstrate that viral genomic fragments extremely closely related to SARS-CoV and SARS-CoV-2 were circulating in horseshoe bats, confirming their importance as the reservoir species for SARS viruses.
Collapse
Affiliation(s)
- Jonathan E Pekar
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA 92093, USA
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA 92093, USA
- These authors contributed equally
| | - Spyros Lytras
- Medical Research Council-University of Glasgow Centre for Virus Research, Glasgow, UK
- These authors contributed equally
| | - Mahan Ghafari
- Department of Biology, University of Oxford, Oxford, UK
| | - Andrew F Magee
- Department of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA 90095, USA
| | - Edyth Parker
- Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA 92037, USA
| | - Jennifer L Havens
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA 92093, USA
| | | | - Tetyana I Vasylyeva
- Department of Medicine, University of California San Diego, La Jolla, CA 92093, USA
| | - Marc A Suchard
- Department of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA 90095, USA
- Department of Biostatistics, Fielding School of Public Health, University of California Los Angeles, Los Angeles, CA 90095, USA
- Department of Computational Medicine, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA 90095, USA
| | - Alice C Hughes
- School of Biological Sciences, University of Hong Kong, Hong Kong
- China Biodiversity Green Development Foundation, Beijing, China
| | - Joseph Hughes
- Medical Research Council-University of Glasgow Centre for Virus Research, Glasgow, UK
| | - David L Robertson
- Medical Research Council-University of Glasgow Centre for Virus Research, Glasgow, UK
- These authors jointly supervised the work
| | - Simon Dellicour
- Spatial Epidemiology Lab (SpELL), Université Libre de Bruxelles, CP160/12, 50 av. FD Roosevelt, 1050, Bruxelles, Belgium
- Department of Microbiology, Immunology and Transplantation, Rega Institute, Laboratory for Clinical and Epidemiological Virology, KU Leuven, Leuven, Belgium
- These authors jointly supervised the work
| | - Michael Worobey
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721, USA
- These authors jointly supervised the work
| | - Joel O Wertheim
- Department of Medicine, University of California San Diego, La Jolla, CA 92093, USA
- These authors jointly supervised the work
| | - Philippe Lemey
- Department of Microbiology, Immunology and Transplantation, Rega Institute, Laboratory for Clinical and Epidemiological Virology, KU Leuven, Leuven, Belgium
- These authors jointly supervised the work
| |
Collapse
|
6
|
Fourment M, Swanepoel CJ, Galloway JG, Ji X, Gangavarapu K, Suchard MA, Matsen IV FA. Automatic Differentiation is no Panacea for Phylogenetic Gradient Computation. Genome Biol Evol 2023; 15:evad099. [PMID: 37265233 PMCID: PMC10282121 DOI: 10.1093/gbe/evad099] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Revised: 05/23/2023] [Accepted: 05/25/2023] [Indexed: 06/03/2023] Open
Abstract
Gradients of probabilistic model likelihoods with respect to their parameters are essential for modern computational statistics and machine learning. These calculations are readily available for arbitrary models via "automatic differentiation" implemented in general-purpose machine-learning libraries such as TensorFlow and PyTorch. Although these libraries are highly optimized, it is not clear if their general-purpose nature will limit their algorithmic complexity or implementation speed for the phylogenetic case compared to phylogenetics-specific code. In this paper, we compare six gradient implementations of the phylogenetic likelihood functions, in isolation and also as part of a variational inference procedure. We find that although automatic differentiation can scale approximately linearly in tree size, it is much slower than the carefully implemented gradient calculation for tree likelihood and ratio transformation operations. We conclude that a mixed approach combining phylogenetic libraries with machine learning libraries will provide the optimal combination of speed and model flexibility moving forward.
Collapse
Affiliation(s)
- Mathieu Fourment
- Australian Institute for Microbiology and Infection, University of Technology Sydney, Ultimo, NSW, Australia
| | - Christiaan J Swanepoel
- Centre for Computational Evolution, The University of Auckland, Auckland, New Zealand
- School of Computer Science, The University of Auckland, Auckland, New Zealand
| | - Jared G Galloway
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
| | - Xiang Ji
- Department of Mathematics, Tulane University, New Orleans, Louisiana, USA
| | - Karthik Gangavarapu
- Department of Human Genetics, University of California, Los Angeles, California, USA
| | - Marc A Suchard
- Department of Human Genetics, University of California, Los Angeles, California, USA
- Department of Computational Medicine, University of California, Los Angeles, California, USA
- Department of Biostatistics, University of California, Los Angeles, California, USA
| | - Frederick A Matsen IV
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
- Department of Statistics, University of Washington, Seattle, Washington, USA
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
- Howard Hughes Medical Institute, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
| |
Collapse
|
7
|
Martin BS, Bradburd GS, Harmon LJ, Weber MG. Modeling the Evolution of Rates of Continuous Trait Evolution. Syst Biol 2022:6830631. [PMID: 36380474 DOI: 10.1093/sysbio/syac068] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2022] [Indexed: 11/17/2022] Open
Abstract
Rates of phenotypic evolution vary markedly across the tree of life, from the accelerated evolution apparent in adaptive radiations to the remarkable evolutionary stasis exhibited by so-called "living fossils". Such rate variation has important consequences for large-scale evolutionary dynamics, generating vast disparities in phenotypic diversity across space, time, and taxa. Despite this, most methods for estimating trait evolution rates assume rates vary deterministically with respect to some variable of interest or change infrequently during a clade's history. These assumptions may cause underfitting of trait evolution models and mislead hypothesis testing. Here, we develop a new trait evolution model that allows rates to vary gradually and stochastically across a clade. Further, we extend this model to accommodate generally decreasing or increasing rates over time, allowing for flexible modeling of "early/late bursts" of trait evolution. We implement a Bayesian method, termed "evolving rates" (evorates for short), to efficiently fit this model to comparative data. Through simulation, we demonstrate that evorates can reliably infer both how and in which lineages trait evolution rates varied during a clade's history. We apply this method to body size evolution in cetaceans, recovering substantial support for an overall slowdown in body size evolution over time with recent bursts among some oceanic dolphins and relative stasis among beaked whales of the genus Mesoplodon. These results unify and expand on previous research, demonstrating the empirical utility of evorates.
Collapse
Affiliation(s)
- B S Martin
- Department of Plant Biology, Ecology, Evolution, and Behavior Program, Michigan State University, East Lansing, MI 48824, USA
| | - G S Bradburd
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
| | - L J Harmon
- Department of Biological Sciences, Institute for Bioinformatics and Evolutionary Studies (IBEST), University of Idaho, Moscow, ID 83843, USA
| | - M G Weber
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
8
|
Fisher AA, Hassler GW, Ji X, Baele G, Suchard MA, Lemey P. Scalable Bayesian phylogenetics. Philos Trans R Soc Lond B Biol Sci 2022; 377:20210242. [PMID: 35989603 PMCID: PMC9393558 DOI: 10.1098/rstb.2021.0242] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Accepted: 04/20/2022] [Indexed: 02/01/2023] Open
Abstract
Recent advances in Bayesian phylogenetics offer substantial computational savings to accommodate increased genomic sampling that challenges traditional inference methods. In this review, we begin with a brief summary of the Bayesian phylogenetic framework, and then conceptualize a variety of methods to improve posterior approximations via Markov chain Monte Carlo (MCMC) sampling. Specifically, we discuss methods to improve the speed of likelihood calculations, reduce MCMC burn-in, and generate better MCMC proposals. We apply several of these techniques to study the evolution of HIV virulence along a 1536-tip phylogeny and estimate the internal node heights of a 1000-tip SARS-CoV-2 phylogenetic tree in order to illustrate the speed-up of such analyses using current state-of-the-art approaches. We conclude our review with a discussion of promising alternatives to MCMC that approximate the phylogenetic posterior. This article is part of a discussion meeting issue 'Genomic population structures of microbial pathogens'.
Collapse
Affiliation(s)
| | - Gabriel W. Hassler
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, University of California, Los Angeles, CA 90095, USA
| | - Xiang Ji
- Department of Mathematics, School of Science and Engineering, Tulane University, New Orleans, LA 70118, USA
| | - Guy Baele
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, 3000 Leuven, Belgium
| | - Marc A. Suchard
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, University of California, Los Angeles, CA 90095, USA
- Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California, Los Angeles, CA 90095, USA
- Department of Human Genetics, David Geffen School of Medicine at UCLA, University of California, Los Angeles, CA 90095, USA
| | - Philippe Lemey
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, 3000 Leuven, Belgium
| |
Collapse
|
9
|
Hassler GW, Gallone B, Aristide L, Allen WL, Tolkoff MR, Holbrook AJ, Baele G, Lemey P, Suchard MA. Principled, practical, flexible, fast: a new approach to phylogenetic factor analysis. Methods Ecol Evol 2022; 13:2181-2197. [PMID: 36908682 PMCID: PMC9997680 DOI: 10.1111/2041-210x.13920] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Biological phenotypes are products of complex evolutionary processes in which selective forces influence multiple biological trait measurements in unknown ways. Phylogenetic comparative methods seek to disentangle these relationships across the evolutionary history of a group of organisms. Unfortunately, most existing methods fail to accommodate high-dimensional data with dozens or even thousands of observations per taxon. Phylogenetic factor analysis offers a solution to the challenge of dimensionality. However, scientists seeking to employ this modeling framework confront numerous modeling and implementation decisions, the details of which pose computational and replicability challenges.We develop new inference techniques that increase both the computational efficiency and modeling flexibility of phylogenetic factor analysis. To facilitate adoption of these new methods, we present a practical analysis plan that guides researchers through the web of complex modeling decisions. We codify this analysis plan in an automated pipeline that distills the potentially overwhelming array of decisions into a small handful of (typically binary) choices.We demonstrate the utility of these methods and analysis plan in four real-world problems of varying scales. Specifically, we study floral phenotype and pollination in columbines, domestication in industrial yeast, life history in mammals, and brain morphology in New World monkeys.General and impactful community employment of these methods requires a data scientific analysis plan that balances flexibility, speed and ease of use, while minimizing model and algorithm tuning. Even in the presence of non-trivial phylogenetic model constraints, we show that one may analytically address latent factor uncertainty in a way that (a) aids model flexibility, (b) accelerates computation (by as much as 500-fold) and (c) decreases required tuning. These efforts coalesce to create an accessible Bayesian approach to high-dimensional phylogenetic comparative methods on large trees.
Collapse
Affiliation(s)
- Gabriel W. Hassler
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, University of California, Los Angeles, United States
| | | | - Leandro Aristide
- Ecole Normale Superieure Paris Sciences et Lettres Research University, Institut de Biologie de l’Ecole Normale Superieure, Paris, France
| | - William L. Allen
- Department of Biosciences, Swansea University, Swansea, United Kingdom
| | - Max R. Tolkoff
- Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California, Los Angeles, United States
| | - Andrew J. Holbrook
- Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California, Los Angeles, United States
| | - Guy Baele
- Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium
| | - Philippe Lemey
- Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium
| | - Marc A. Suchard
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, University of California, Los Angeles, United States
- Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California, Los Angeles, United States
- Department of Human Genetics, David Geffen School of Medicine at UCLA, Universtiy of California, Los Angeles, United States
| |
Collapse
|
10
|
Hassler GW, Magee A, Zhang Z, Baele G, Lemey P, Ji X, Fourment M, Suchard MA. Data integration in Bayesian phylogenetics. ANNUAL REVIEW OF STATISTICS AND ITS APPLICATION 2022; 10:353-377. [PMID: 38774036 PMCID: PMC11108065 DOI: 10.1146/annurev-statistics-033021-112532] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2024]
Abstract
Researchers studying the evolution of viral pathogens and other organisms increasingly encounter and use large and complex data sets from multiple different sources. Statistical research in Bayesian phylogenetics has risen to this challenge. Researchers use phylogenetics not only to reconstruct the evolutionary history of a group of organisms, but also to understand the processes that guide its evolution and spread through space and time. To this end, it is now the norm to integrate numerous sources of data. For example, epidemiologists studying the spread of a virus through a region incorporate data including genetic sequences (e.g. DNA), time, location (both continuous and discrete) and environmental covariates (e.g. social connectivity between regions) into a coherent statistical model. Evolutionary biologists routinely do the same with genetic sequences, location, time, fossil and modern phenotypes, and ecological covariates. These complex, hierarchical models readily accommodate both discrete and continuous data and have enormous combined discrete/continuous parameter spaces including, at a minimum, phylogenetic tree topologies and branch lengths. The increased size and complexity of these statistical models have spurred advances in computational methods to make them tractable. We discuss both the modeling and computational advances below, as well as unsolved problems and areas of active research.
Collapse
Affiliation(s)
- Gabriel W Hassler
- Department of Computational Medicine, University of California, Los Angeles, USA, 90095
| | - Andrew Magee
- Department of Biostatistics, University of California, Los Angeles, USA, 90095
| | - Zhenyu Zhang
- Department of Biostatistics, University of California, Los Angeles, USA, 90095
| | - Guy Baele
- Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium, 3000
| | - Philippe Lemey
- Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium, 3000
| | - Xiang Ji
- Department of Mathematics, Tulane University, New Orleans, USA, 70118
| | - Mathieu Fourment
- Australian Institute for Microbiology and Infection, University of Technology Sydney, Ultimo NSW, Australia, 2007
| | - Marc A Suchard
- Department of Computational Medicine, University of California, Los Angeles, USA, 90095
- Department of Biostatistics, University of California, Los Angeles, USA, 90095
- Department of Human Genetics, University of California, Los Angeles, USA, 90095
| |
Collapse
|
11
|
Grundler MC, Rabosky DL, Zapata F. Fast Likelihood Calculations for Automatic Identification of Macroevolutionary Rate Heterogeneity in Continuous and Discrete Traits. Syst Biol 2022; 71:1307-1318. [DOI: 10.1093/sysbio/syac035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Revised: 04/28/2022] [Accepted: 05/06/2022] [Indexed: 11/13/2022] Open
Abstract
Abstract
Understanding phenotypic disparity across the tree of life requires identifying where and when evolutionary rates change on phylogeny. A primary methodological challenge in macroevolution is therefore to develop methods for accurate inference of among-lineage variation in rates of phenotypic evolution. Here, we describe a method for inferring among-lineage evolutionary rate heterogeneity in both continuous and discrete traits. The method assumes that the present-day distribution of a trait is shaped by a variable-rate process arising from a mixture of constant-rate processes and uses a single-pass tree traversal algorithm to estimate branch-specific evolutionary rates. By employing dynamic programming optimization techniques and approximate maximum likelihood estimators where appropriate, our method permits rapid exploration of the tempo and mode of phenotypic evolution. Simulations indicate that the method reconstructs rates of trait evolution with high accuracy. Application of the method to datasets on squamate reptile reproduction and turtle body size recovers patterns of rate heterogeneity identified by previous studies but with computational costs reduced by many orders of magnitude. Our results expand the set of tools available for detecting macroevolutionary rate heterogeneity and point to the utility of fast, approximate methods for studying large scale biodiversity dynamics.
Collapse
Affiliation(s)
- Michael C Grundler
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA 90095, USA
| | - Daniel L Rabosky
- Museum of Zoology and Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
| | - Felipe Zapata
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA 90095, USA
| |
Collapse
|
12
|
Holbrook AJ, Ji X, Suchard MA. From viral evolution to spatial contagion: a biologically modulated Hawkes model. Bioinformatics 2022; 38:1846-1856. [PMID: 35040956 PMCID: PMC8963291 DOI: 10.1093/bioinformatics/btac027] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2021] [Revised: 12/11/2021] [Accepted: 01/12/2022] [Indexed: 02/04/2023] Open
Abstract
SUMMARY Mutations sometimes increase contagiousness for evolving pathogens. During an epidemic, scientists use viral genome data to infer a shared evolutionary history and connect this history to geographic spread. We propose a model that directly relates a pathogen's evolution to its spatial contagion dynamics-effectively combining the two epidemiological paradigms of phylogenetic inference and self-exciting process modeling-and apply this phylogenetic Hawkes process to a Bayesian analysis of 23 421 viral cases from the 2014 to 2016 Ebola outbreak in West Africa. The proposed model is able to detect individual viruses with significantly elevated rates of spatiotemporal propagation for a subset of 1610 samples that provide genome data. Finally, to facilitate model application in big data settings, we develop massively parallel implementations for the gradient and Hessian of the log-likelihood and apply our high-performance computing framework within an adaptively pre-conditioned Hamiltonian Monte Carlo routine. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Andrew J Holbrook
- Department of Biostatistics, University of California, Los Angeles, CA 90095, USA
| | - Xiang Ji
- Department of Mathematics, Tulane University, New Orleans, LA 70118, USA
| | - Marc A Suchard
- Department of Biostatistics, University of California, Los Angeles, CA 90095, USA
- Department of Biomathematics
- Department of Human Genetics, University of California, Los Angeles, CA 90095, USA
| |
Collapse
|
13
|
Bastide P, Ho LST, Baele G, Lemey P, Suchard MA. Efficient Bayesian inference of general Gaussian models on large phylogenetic trees. Ann Appl Stat 2021. [DOI: 10.1214/20-aoas1419] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
| | - Lam Si Tung Ho
- Department of Mathematics and Statistics, Dalhousie University
| | - Guy Baele
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven
| | - Philippe Lemey
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven
| | - Marc A. Suchard
- Departments of Biostatistics, Biomathematics, and Human Genetics, University of California, Los Angeles
| |
Collapse
|
14
|
Dellicour S, Lequime S, Vrancken B, Gill MS, Bastide P, Gangavarapu K, Matteson NL, Tan Y, du Plessis L, Fisher AA, Nelson MI, Gilbert M, Suchard MA, Andersen KG, Grubaugh ND, Pybus OG, Lemey P. Epidemiological hypothesis testing using a phylogeographic and phylodynamic framework. Nat Commun 2020; 11:5620. [PMID: 33159066 PMCID: PMC7648063 DOI: 10.1038/s41467-020-19122-z] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2020] [Accepted: 09/30/2020] [Indexed: 01/05/2023] Open
Abstract
Computational analyses of pathogen genomes are increasingly used to unravel the dispersal history and transmission dynamics of epidemics. Here, we show how to go beyond historical reconstructions and use spatially-explicit phylogeographic and phylodynamic approaches to formally test epidemiological hypotheses. We illustrate our approach by focusing on the West Nile virus (WNV) spread in North America that has substantially impacted public, veterinary, and wildlife health. We apply an analytical workflow to a comprehensive WNV genome collection to test the impact of environmental factors on the dispersal of viral lineages and on viral population genetic diversity through time. We find that WNV lineages tend to disperse faster in areas with higher temperatures and we identify temporal variation in temperature as a main predictor of viral genetic diversity through time. By contrasting inference with simulation, we find no evidence for viral lineages to preferentially circulate within the same migratory bird flyway, suggesting a substantial role for non-migratory birds or mosquito dispersal along the longitudinal gradient.
Collapse
Affiliation(s)
- Simon Dellicour
- Spatial Epidemiology Lab (SpELL), Université Libre de Bruxelles, CP160/12, 50 Avenue FD Roosevelt, 1050, Bruxelles, Belgium.
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Herestraat 49, 3000, Leuven, Belgium.
| | - Sebastian Lequime
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Herestraat 49, 3000, Leuven, Belgium
| | - Bram Vrancken
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Herestraat 49, 3000, Leuven, Belgium
| | - Mandev S Gill
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Herestraat 49, 3000, Leuven, Belgium
| | - Paul Bastide
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Herestraat 49, 3000, Leuven, Belgium
| | - Karthik Gangavarapu
- Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Nathaniel L Matteson
- Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Yi Tan
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
- Infectious Diseases Group, J. Craig Venter Institute, Rockville, MD, USA
| | | | - Alexander A Fisher
- Department of Biomathematics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
| | - Martha I Nelson
- Fogarty International Center, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Marius Gilbert
- Spatial Epidemiology Lab (SpELL), Université Libre de Bruxelles, CP160/12, 50 Avenue FD Roosevelt, 1050, Bruxelles, Belgium
| | - Marc A Suchard
- Department of Biomathematics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
- Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles, CA, USA
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
| | - Kristian G Andersen
- Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA, 92037, USA
- Scripps Research Translational Institute, La Jolla, CA, 92037, USA
| | - Nathan D Grubaugh
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, 06510, USA
| | | | - Philippe Lemey
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Herestraat 49, 3000, Leuven, Belgium
| |
Collapse
|