1
|
Brusselmans M, Carvalho LM, L. Hong S, Gao J, Matsen IV FA, Rambaut A, Lemey P, Suchard MA, Dudas G, Baele G. On the importance of assessing topological convergence in Bayesian phylogenetic inference. Virus Evol 2024; 10:veae081. [PMID: 39534377 PMCID: PMC11556345 DOI: 10.1093/ve/veae081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2024] [Revised: 08/15/2024] [Accepted: 09/23/2024] [Indexed: 11/16/2024] Open
Abstract
Modern phylogenetics research is often performed within a Bayesian framework, using sampling algorithms such as Markov chain Monte Carlo (MCMC) to approximate the posterior distribution. These algorithms require careful evaluation of the quality of the generated samples. Within the field of phylogenetics, one frequently adopted diagnostic approach is to evaluate the effective sample size and to investigate trace graphs of the sampled parameters. A major limitation of these approaches is that they are developed for continuous parameters and therefore incompatible with a crucial parameter in these inferences: the tree topology. Several recent advancements have aimed at extending these diagnostics to topological space. In this reflection paper, we present two case studies-one on Ebola virus and one on HIV-illustrating how these topological diagnostics can contain information not found in standard diagnostics, and how decisions regarding which of these diagnostics to compute can impact inferences regarding MCMC convergence and mixing. Our results show the importance of running multiple replicate analyses and of carefully assessing topological convergence using the output of these replicate analyses. To this end, we illustrate different ways of assessing and visualizing the topological convergence of these replicates. Given the major importance of detecting convergence and mixing issues in Bayesian phylogenetic analyses, the lack of a unified approach to this problem warrants further action, especially now that additional tools are becoming available to researchers.
Collapse
Affiliation(s)
- Marius Brusselmans
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| | - Luiz Max Carvalho
- School of Applied Mathematics, Getulio Vargas Foundation, Praia de Botafogo, 190, 22250-900 Rio de Janeiro, Brazil
| | - Samuel L. Hong
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| | - Jiansi Gao
- Computational Biology Program, Fred Hutchinson Cancer Center, Seattle, WA 98109, United States
| | - Frederick A Matsen IV
- Howard Hughes Medical Institute, Computational Biology Program, Fred Hutchinson Cancer Research Center,Seattle, Washington, United States
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States
- Department of Statistics, University of Washington, Seattle, Washington, United States
| | - Andrew Rambaut
- Institute of Ecology and Evolution, University of Edinburgh, Edinburgh EH9 3FL, United Kingdom
| | - Philippe Lemey
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| | - Marc A Suchard
- Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles, CA 90095, USA
| | - Gytis Dudas
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Vilnius, Lithuania
| | - Guy Baele
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| |
Collapse
|
2
|
Brusselmans M, Carvalho LM, Hong SL, Gao J, Matsen FA, Rambaut A, Lemey P, Suchard MA, Dudas G, Baele G. On the importance of assessing topological convergence in Bayesian phylogenetic inference. ARXIV 2024:arXiv:2402.11657v2. [PMID: 39253641 PMCID: PMC11383445] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 09/11/2024]
Abstract
Modern phylogenetics research is often performed within a Bayesian framework, using sampling algorithms such as Markov chain Monte Carlo (MCMC) to approximate the posterior distribution. These algorithms require careful evaluation of the quality of the generated samples. Within the field of phylogenetics, one frequently adopted diagnostic approach is to evaluate the effective sample size (ESS) and to investigate trace graphs of the sampled parameters. A major limitation of these approaches is that they are developed for continuous parameters and therefore incompatible with a crucial parameter in these inferences: the tree topology. Several recent advancements have aimed at extending these diagnostics to topological space. In this reflection paper, we present two case studies - one on Ebola virus and one on HIV - illustrating how these topological diagnostics can contain information not found in standard diagnostics, and how decisions regarding which of these diagnostics to compute can impact inferences regarding MCMC convergence and mixing. Our results show the importance of running multiple replicate analyses and of carefully assessing topological convergence using the output of these replicate analyses. To this end, we illustrate different ways of assessing and visualizing the topological convergence of these replicates. Given the major importance of detecting convergence and mixing issues in Bayesian phylogenetic analyses, the lack of a unified approach to this problem warrants further action, especially now that additional tools are becoming available to researchers.
Collapse
Affiliation(s)
- Marius Brusselmans
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| | - Luiz Max Carvalho
- School of Applied Mathematics, Getulio Vargas Foundation, Praia de Botafogo, 190, 22250-900, Rio de Janeiro, Brazil
| | - Samuel L. Hong
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| | - Jiansi Gao
- Computational Biology Program, Fred Hutchinson Cancer Center, Seattle, WA 98109, USA
| | - Frederick A. Matsen
- Howard Hughes Medical Institute, Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
- Department of Statistics, University of Washington, Seattle, Washington, USA
| | - Andrew Rambaut
- Institute of Ecology and Evolution, University of Edinburgh, Edinburgh, EH9, 3FL, UK
| | - Philippe Lemey
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| | - Marc A. Suchard
- Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles, CA 90095, USA
| | - Gytis Dudas
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Vilnius, Lithuania
| | - Guy Baele
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| |
Collapse
|
3
|
Suvorov A, Schrider DR. Reliable estimation of tree branch lengths using deep neural networks. PLoS Comput Biol 2024; 20:e1012337. [PMID: 39102450 PMCID: PMC11326709 DOI: 10.1371/journal.pcbi.1012337] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Revised: 08/15/2024] [Accepted: 07/18/2024] [Indexed: 08/07/2024] Open
Abstract
A phylogenetic tree represents hypothesized evolutionary history for a set of taxa. Besides the branching patterns (i.e., tree topology), phylogenies contain information about the evolutionary distances (i.e. branch lengths) between all taxa in the tree, which include extant taxa (external nodes) and their last common ancestors (internal nodes). During phylogenetic tree inference, the branch lengths are typically co-estimated along with other phylogenetic parameters during tree topology space exploration. There are well-known regions of the branch length parameter space where accurate estimation of phylogenetic trees is especially difficult. Several novel studies have recently demonstrated that machine learning approaches have the potential to help solve phylogenetic problems with greater accuracy and computational efficiency. In this study, as a proof of concept, we sought to explore the possibility of machine learning models to predict branch lengths. To that end, we designed several deep learning frameworks to estimate branch lengths on fixed tree topologies from multiple sequence alignments or its representations. Our results show that deep learning methods can exhibit superior performance in some difficult regions of branch length parameter space. For example, in contrast to maximum likelihood inference, which is typically used for estimating branch lengths, deep learning methods are more efficient and accurate. In general, we find that our neural networks achieve similar accuracy to a Bayesian approach and are the best-performing methods when inferring long branches that are associated with distantly related taxa. Together, our findings represent a next step toward accurate, fast, and reliable phylogenetic inference with machine learning approaches.
Collapse
Affiliation(s)
- Anton Suvorov
- Department of Biological Sciences, Virginia Tech, Blacksburg, Virginia, United States of America
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| | - Daniel R Schrider
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| |
Collapse
|
4
|
Khurana MP, Scheidwasser-Clow N, Penn MJ, Bhatt S, Duchêne DA. The Limits of the Constant-rate Birth-Death Prior for Phylogenetic Tree Topology Inference. Syst Biol 2024; 73:235-246. [PMID: 38153910 PMCID: PMC11129600 DOI: 10.1093/sysbio/syad075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 12/20/2023] [Accepted: 12/27/2023] [Indexed: 12/30/2023] Open
Abstract
Birth-death models are stochastic processes describing speciation and extinction through time and across taxa and are widely used in biology for inference of evolutionary timescales. Previous research has highlighted how the expected trees under the constant-rate birth-death (crBD) model tend to differ from empirical trees, for example, with respect to the amount of phylogenetic imbalance. However, our understanding of how trees differ between the crBD model and the signal in empirical data remains incomplete. In this Point of View, we aim to expose the degree to which the crBD model differs from empirically inferred phylogenies and test the limits of the model in practice. Using a wide range of topology indices to compare crBD expectations against a comprehensive dataset of 1189 empirically estimated trees, we confirm that crBD model trees frequently differ topologically compared with empirical trees. To place this in the context of standard practice in the field, we conducted a meta-analysis for a subset of the empirical studies. When comparing studies that used Bayesian methods and crBD priors with those that used other non-crBD priors and non-Bayesian methods (i.e., maximum likelihood methods), we do not find any significant differences in tree topology inferences. To scrutinize this finding for the case of highly imbalanced trees, we selected the 100 trees with the greatest imbalance from our dataset, simulated sequence data for these tree topologies under various evolutionary rates, and re-inferred the trees under maximum likelihood and using the crBD model in a Bayesian setting. We find that when the substitution rate is low, the crBD prior results in overly balanced trees, but the tendency is negligible when substitution rates are sufficiently high. Overall, our findings demonstrate the general robustness of crBD priors across a broad range of phylogenetic inference scenarios but also highlight that empirically observed phylogenetic imbalance is highly improbable under the crBD model, leading to systematic bias in data sets with limited information content.
Collapse
Affiliation(s)
- Mark P Khurana
- Section of Epidemiology, Department of Public Health, University of Copenhagen, 1352 Copenhagen, Denmark
| | - Neil Scheidwasser-Clow
- Section of Epidemiology, Department of Public Health, University of Copenhagen, 1352 Copenhagen, Denmark
| | - Matthew J Penn
- Department of Statistics, University of Oxford, OX1 3LB, Oxford, UK
| | - Samir Bhatt
- Section of Epidemiology, Department of Public Health, University of Copenhagen, 1352 Copenhagen, Denmark
- MRC Centre for Global Infectious Disease Analysis, School of Public Health, Imperial College London, SW7 2AZ, London, UK
| | - David A Duchêne
- Centre for Evolutionary Hologenomics, University of Copenhagen, 1352 Copenhagen, Denmark
| |
Collapse
|
5
|
Agranat-Tamir L, Mathur S, Rosenberg NA. Enumeration of Rooted Binary Unlabeled Galled Trees. Bull Math Biol 2024; 86:45. [PMID: 38519704 PMCID: PMC10959814 DOI: 10.1007/s11538-024-01270-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Accepted: 02/15/2024] [Indexed: 03/25/2024]
Abstract
Rooted binary galled trees generalize rooted binary trees to allow a restricted class of cycles, known as galls. We build upon the Wedderburn-Etherington enumeration of rooted binary unlabeled trees with n leaves to enumerate rooted binary unlabeled galled trees with n leaves, also enumerating rooted binary unlabeled galled trees with n leaves and g galls, 0 ⩽ g ⩽ ⌊ n - 1 2 ⌋ . The enumerations rely on a recursive decomposition that considers subtrees descended from the nodes of a gall, adopting a restriction on galls that amounts to considering only the rooted binary normal unlabeled galled trees in our enumeration. We write an implicit expression for the generating function encoding the numbers of trees for all n. We show that the number of rooted binary unlabeled galled trees grows with 0.0779 ( 4 . 8230 n ) n - 3 2 , exceeding the growth 0.3188 ( 2 . 4833 n ) n - 3 2 of the number of rooted binary unlabeled trees without galls. However, the growth of the number of galled trees with only one gall has the same exponential order 2.4833 as the number with no galls, exceeding it only in the subexponential term, 0.3910 n 1 2 compared to 0.3188 n - 3 2 . For a fixed number of leaves n, the number of galls g that produces the largest number of rooted binary unlabeled galled trees lies intermediate between the minimum of g = 0 and the maximum of g = ⌊ n - 1 2 ⌋ . We discuss implications in mathematical phylogenetics.
Collapse
Affiliation(s)
| | - Shaili Mathur
- Department of Biology, Stanford University, Stanford, CA, 94305, USA
| | - Noah A Rosenberg
- Department of Biology, Stanford University, Stanford, CA, 94305, USA
| |
Collapse
|
6
|
Gascon M, El-Mabrouk N. On the complexity of non-binary tree reconciliation with endosymbiotic gene transfer. Algorithms Mol Biol 2023; 18:9. [PMID: 37518001 PMCID: PMC10388533 DOI: 10.1186/s13015-023-00231-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Accepted: 06/10/2023] [Indexed: 08/01/2023] Open
Abstract
Reconciling a non-binary gene tree with a binary species tree can be done efficiently in the absence of horizontal gene transfers, but becomes NP-hard in the presence of gene transfers. Here, we focus on the special case of endosymbiotic gene transfers (EGT), i.e. transfers between the mitochondrial and nuclear genome of the same species. More precisely, given a multifurcated (non-binary) gene tree with leaves labeled 0 or 1 depending on whether the corresponding genes belong to the mitochondrial or nuclear genome of the corresponding species, we investigate the problem of inferring a most parsimonious Duplication, Loss and EGT (DLE) Reconciliation of any binary refinement of the tree. We present a general two-steps method: ignoring the 0-1 labeling of leaves, output a binary resolution minimizing the Duplication and Loss (DL) Reconciliation and then, for such resolution, assign a known number of 0s and 1s to the leaves in a way minimizing EGT events. While the first step corresponds to the well studied non-binary DL-Reconciliation problem, the complexity of the label assignment problem corresponding to the second step is unknown. We show that this problem is NP-complete, even when the tree is restricted to a single polytomy, and even if transfers can occur in only one direction. We present a general algorithm solving each polytomy separately, which is shown optimal for a unitary cost of operation, and a polynomial-time algorithm for solving a polytomy in the special case where genes are specific to a single genome (mitochondrial or nuclear) in all but one species. This work represents the first algorithmic study for reconciliation with endosymbiotic gene transfers in the case of a multifurcated gene tree.
Collapse
Affiliation(s)
- Mathieu Gascon
- Département d'Informatique et de Recherche Opérationnelle, Université de Montréal, Montréal, Canada
| | - Nadia El-Mabrouk
- Département d'Informatique et de Recherche Opérationnelle, Université de Montréal, Montréal, Canada.
| |
Collapse
|
7
|
Sgarbossa D, Lupo U, Bitbol AF. Generative power of a protein language model trained on multiple sequence alignments. eLife 2023; 12:e79854. [PMID: 36734516 PMCID: PMC10038667 DOI: 10.7554/elife.79854] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2022] [Accepted: 02/02/2023] [Indexed: 02/04/2023] Open
Abstract
Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution, and structure-based measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally validated ones. Moreover, for small protein families, our generation method based on MSA Transformer outperforms Potts models. Our method also more accurately reproduces the higher-order statistics and the distribution of sequences in sequence space of natural data than Potts models. MSA Transformer is thus a strong candidate for protein sequence generation and protein design.
Collapse
Affiliation(s)
- Damiano Sgarbossa
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL)LausanneSwitzerland
- SIB Swiss Institute of BioinformaticsLausanneSwitzerland
| | - Umberto Lupo
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL)LausanneSwitzerland
- SIB Swiss Institute of BioinformaticsLausanneSwitzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL)LausanneSwitzerland
- SIB Swiss Institute of BioinformaticsLausanneSwitzerland
| |
Collapse
|
8
|
Distributions of cherries and pitchforks for the Ford model. Theor Popul Biol 2023; 149:27-38. [PMID: 36566944 DOI: 10.1016/j.tpb.2022.12.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2022] [Revised: 12/11/2022] [Accepted: 12/13/2022] [Indexed: 12/24/2022]
Abstract
Distributional properties of tree shape statistics under random phylogenetic tree models play an important role in investigating the evolutionary forces underlying the observed phylogenies. In this paper, we study two subtree counting statistics, the number of cherries and that of pitchforks for the Ford model, the alpha model introduced by Daniel Ford. It is a one-parameter family of random phylogenetic tree models which includes the proportional to distinguishable arrangement (PDA) and the Yule models, two tree models commonly used in phylogenetics. Based on a non-uniform version of the extended Pólya urn models in which negative entries are permitted for their replacement matrices, we obtain the strong law of large numbers and the central limit theorem for the joint distribution of these two statistics for the Ford model. Furthermore, we derive a recursive formula for computing the exact joint distribution of these two statistics. This leads to exact formulas for their means and higher order asymptotic expansions of their second moments, which allows us to identify a critical parameter value for the correlation between these two statistics. That is, when the number of tree leaves is sufficiently large, they are negatively correlated for 0≤α≤1/2 and positively correlated for 1/2<α<1.
Collapse
|
9
|
Two results about the Sackin and Colless indices for phylogenetic trees and their shapes. J Math Biol 2022; 85:69. [PMID: 36418585 DOI: 10.1007/s00285-022-01831-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2022] [Revised: 08/27/2022] [Accepted: 10/23/2022] [Indexed: 11/25/2022]
Abstract
The Sackin and Colless indices are two widely-used metrics for measuring the balance of trees and for testing evolutionary models in phylogenetics. This short paper contributes two results about the Sackin and Colless indices of trees. One result is the asymptotic analysis of the expected Sackin and Colless indices of tree shapes (which are full binary rooted unlabelled trees) under the uniform model where tree shapes are sampled with equal probability. Another is a short direct proof of the closed formula for the expected Sackin index of phylogenetic trees (which are full binary rooted trees with leaves being labelled with taxa) under the uniform model.
Collapse
|
10
|
Hayati M, Chindelevitch L, Aanensen D, Colijn C. Deep clustering of bacterial tree images. Philos Trans R Soc Lond B Biol Sci 2022; 377:20210231. [PMID: 35989604 PMCID: PMC9393560 DOI: 10.1098/rstb.2021.0231] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Accepted: 05/17/2022] [Indexed: 01/25/2023] Open
Abstract
The field of genomic epidemiology is rapidly growing as many jurisdictions begin to deploy whole-genome sequencing (WGS) in their national or regional pathogen surveillance programmes. WGS data offer a rich view of the shared ancestry of a set of taxa, typically visualized with phylogenetic trees illustrating the clusters or subtypes present in a group of taxa, their relatedness and the extent of diversification within and between them. When methicillin-resistant Staphylococcus aureus (MRSA) arose and disseminated widely, phylogenetic trees of MRSA-containing types of S. aureus had a distinctive 'comet' shape, with a 'comet head' of recently adapted drug-resistant isolates in the context of a 'comet tail' that was predominantly drug-sensitive. Placing an S. aureus isolate in the context of such a 'comet' helped public health laboratories interpret local data within the broader setting of S. aureus evolution. In this work, we ask what other tree shapes, analogous to the MRSA comet, are present in bacterial WGS datasets. We extract trees from large bacterial genomic datasets, visualize them as images and cluster the images. We find nine major groups of tree images, including the 'comets', star-like phylogenies, 'barbell' phylogenies and other shapes, and comment on the evolutionary and epidemiological stories these shapes might illustrate. This article is part of a discussion meeting issue 'Genomic population structures of microbial pathogens'.
Collapse
Affiliation(s)
- Maryam Hayati
- School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada V5A 1S6
| | - Leonid Chindelevitch
- Department of Infectious Disease Epidemiology, Imperial College, Praed Street, London W2 1NY, UK
| | - David Aanensen
- Big Data Institute, University of Oxford, Old Road Campus, Oxford OX3 7LF, UK
| | - Caroline Colijn
- Department of Mathematics, Simon Fraser University, Burnaby, British Columbia, Canada V5A 1S6
| |
Collapse
|
11
|
Smith MR. Robust Analysis of Phylogenetic Tree Space. Syst Biol 2022; 71:1255-1270. [PMID: 34963003 PMCID: PMC9366458 DOI: 10.1093/sysbio/syab100] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Revised: 12/03/2021] [Accepted: 12/23/2021] [Indexed: 11/13/2022] Open
Abstract
Phylogenetic analyses often produce large numbers of trees. Mapping trees' distribution in "tree space" can illuminate the behavior and performance of search strategies, reveal distinct clusters of optimal trees, and expose differences between different data sources or phylogenetic methods-but the high-dimensional spaces defined by metric distances are necessarily distorted when represented in fewer dimensions. Here, I explore the consequences of this transformation in phylogenetic search results from 128 morphological data sets, using stratigraphic congruence-a complementary aspect of tree similarity-to evaluate the utility of low-dimensional mappings. I find that phylogenetic similarities between cladograms are most accurately depicted in tree spaces derived from information-theoretic tree distances or the quartet distance. Robinson-Foulds tree spaces exhibit prominent distortions and often fail to group trees according to phylogenetic similarity, whereas the strong influence of tree shape on the Kendall-Colijn distance makes its tree space unsuitable for many purposes. Distances mapped into two or even three dimensions often display little correspondence with true distances, which can lead to profound misrepresentation of clustering structure. Without explicit testing, one cannot be confident that a tree space mapping faithfully represents the true distribution of trees, nor that visually evident structure is valid. My recommendations for tree space validation and visualization are implemented in a new graphical user interface in the "TreeDist" R package. [Multidimensional scaling; phylogenetic software; tree distance metrics; treespace projections.].
Collapse
Affiliation(s)
- Martin R Smith
- Department of Earth Sciences, Durham University, Durham, UK
| |
Collapse
|
12
|
Voznica J, Zhukova A, Boskova V, Saulnier E, Lemoine F, Moslonka-Lefebvre M, Gascuel O. Deep learning from phylogenies to uncover the epidemiological dynamics of outbreaks. Nat Commun 2022; 13:3896. [PMID: 35794110 PMCID: PMC9258765 DOI: 10.1038/s41467-022-31511-0] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Accepted: 06/21/2022] [Indexed: 12/03/2022] Open
Abstract
Widely applicable, accurate and fast inference methods in phylodynamics are needed to fully profit from the richness of genetic data in uncovering the dynamics of epidemics. Standard methods, including maximum-likelihood and Bayesian approaches, generally rely on complex mathematical formulae and approximations, and do not scale with dataset size. We develop a likelihood-free, simulation-based approach, which combines deep learning with (1) a large set of summary statistics measured on phylogenies or (2) a complete and compact representation of trees, which avoids potential limitations of summary statistics and applies to any phylodynamics model. Our method enables both model selection and estimation of epidemiological parameters from very large phylogenies. We demonstrate its speed and accuracy on simulated data, where it performs better than the state-of-the-art methods. To illustrate its applicability, we assess the dynamics induced by superspreading individuals in an HIV dataset of men-having-sex-with-men in Zurich. Our tool PhyloDeep is available on github.com/evolbioinfo/phylodeep .
Collapse
Affiliation(s)
- J Voznica
- Institut Pasteur, Université Paris Cité, Unité Bioinformatique Evolutive, Paris, France.
- Université de Paris, Paris, France.
- Institut de Biologie de l'École Normale Supérieure, Ecole Normale Supérieure, CNRS, INSERM, Université Paris Sciences et Lettres, Paris, France.
| | - A Zhukova
- Institut Pasteur, Université Paris Cité, Unité Bioinformatique Evolutive, Paris, France.
- Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, Paris, France.
- Institut Pasteur, Université Paris Cité, Epidemiology and Modelling of Antibiotic Evasion, Paris, France.
- Université Paris-Saclay, UVSQ, Inserm, CESP, Villejuif, France.
| | - V Boskova
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna and Medical University of Vienna, Vienna, Austria
| | - E Saulnier
- Institut Pasteur, Université Paris Cité, Unité Bioinformatique Evolutive, Paris, France
| | - F Lemoine
- Institut Pasteur, Université Paris Cité, Unité Bioinformatique Evolutive, Paris, France
- Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, Paris, France
| | - M Moslonka-Lefebvre
- Institut Pasteur, Université Paris Cité, Unité Bioinformatique Evolutive, Paris, France
| | - O Gascuel
- Institut Pasteur, Université Paris Cité, Unité Bioinformatique Evolutive, Paris, France.
- Institut de Systématique, Evolution, Biodiversité (UMR 7205 - CNRS, Muséum National d'Histoire Naturelle, SU, EPHE, UA), Paris, France.
| |
Collapse
|
13
|
Cappello L, Kim J, Liu S, Palacios JA. Statistical Challenges in Tracking the Evolution of SARS-CoV-2. Stat Sci 2022; 37:162-182. [PMID: 36034090 PMCID: PMC9409356 DOI: 10.1214/22-sts853] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Genomic surveillance of SARS-CoV-2 has been instrumental in tracking the spread and evolution of the virus during the pandemic. The availability of SARS-CoV-2 molecular sequences isolated from infected individuals, coupled with phylodynamic methods, have provided insights into the origin of the virus, its evolutionary rate, the timing of introductions, the patterns of transmission, and the rise of novel variants that have spread through populations. Despite enormous global efforts of governments, laboratories, and researchers to collect and sequence molecular data, many challenges remain in analyzing and interpreting the data collected. Here, we describe the models and methods currently used to monitor the spread of SARS-CoV-2, discuss long-standing and new statistical challenges, and propose a method for tracking the rise of novel variants during the epidemic.
Collapse
Affiliation(s)
- Lorenzo Cappello
- Departments of Economics and Business, Universitat Pompeu Fabra, 08005, Spain
| | - Jaehee Kim
- Department of Computational Biology, Cornell University, Ithaca, New York 14853, USA\
| | - Sifan Liu
- Department of Statistics, Stanford University, Stanford, California 94305, USA
| | - Julia A Palacios
- Departments of Statistics and Biomedical Data Sciences, Stanford University, Stanford, California 94305, USA
| |
Collapse
|
14
|
Lynch AR, Arp NL, Zhou AS, Weaver BA, Burkard ME. Quantifying chromosomal instability from intratumoral karyotype diversity using agent-based modeling and Bayesian inference. eLife 2022; 11:e69799. [PMID: 35380536 PMCID: PMC9054132 DOI: 10.7554/elife.69799] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Accepted: 04/01/2022] [Indexed: 12/03/2022] Open
Abstract
Chromosomal instability (CIN)-persistent chromosome gain or loss through abnormal mitotic segregation-is a hallmark of cancer that drives aneuploidy. Intrinsic chromosome mis-segregation rate, a measure of CIN, can inform prognosis and is a promising biomarker for response to anti-microtubule agents. However, existing methodologies to measure this rate are labor intensive, indirect, and confounded by selection against aneuploid cells, which reduces observable diversity. We developed a framework to measure CIN, accounting for karyotype selection, using simulations with various levels of CIN and models of selection. To identify the model parameters that best fit karyotype data from single-cell sequencing, we used approximate Bayesian computation to infer mis-segregation rates and karyotype selection. Experimental validation confirmed the extensive chromosome mis-segregation rates caused by the chemotherapy paclitaxel (18.5 ± 0.5/division). Extending this approach to clinical samples revealed that inferred rates fell within direct observations of cancer cell lines. This work provides the necessary framework to quantify CIN in human tumors and develop it as a predictive biomarker.
Collapse
Affiliation(s)
- Andrew R Lynch
- Carbone Cancer Center, University of Wisconsin-MadisonMadisonUnited States
- McArdle Laboratory for Cancer Research, University of Wisconsin-MadisonMadisonUnited States
| | - Nicholas L Arp
- Carbone Cancer Center, University of Wisconsin-MadisonMadisonUnited States
| | - Amber S Zhou
- Carbone Cancer Center, University of Wisconsin-MadisonMadisonUnited States
- McArdle Laboratory for Cancer Research, University of Wisconsin-MadisonMadisonUnited States
| | - Beth A Weaver
- Carbone Cancer Center, University of Wisconsin-MadisonMadisonUnited States
- McArdle Laboratory for Cancer Research, University of Wisconsin-MadisonMadisonUnited States
- Department of Cell and Regenerative Biology, University of WisconsinMadisonUnited States
| | - Mark E Burkard
- Carbone Cancer Center, University of Wisconsin-MadisonMadisonUnited States
- McArdle Laboratory for Cancer Research, University of Wisconsin-MadisonMadisonUnited States
- Division of Hematology Medical Oncology and Palliative Care, Department of Medicine University of WisconsinMadisonUnited States
| |
Collapse
|
15
|
OUP accepted manuscript. Syst Biol 2022; 71:1378-1390. [DOI: 10.1093/sysbio/syac008] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2020] [Revised: 02/05/2022] [Accepted: 02/08/2022] [Indexed: 11/12/2022] Open
|
16
|
Chindelevitch L, Hayati M, Poon AFY, Colijn C. Network science inspires novel tree shape statistics. PLoS One 2021; 16:e0259877. [PMID: 34941890 PMCID: PMC8699983 DOI: 10.1371/journal.pone.0259877] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2021] [Accepted: 10/28/2021] [Indexed: 11/18/2022] Open
Abstract
The shape of phylogenetic trees can be used to gain evolutionary insights. A tree’s shape specifies the connectivity of a tree, while its branch lengths reflect either the time or genetic distance between branching events; well-known measures of tree shape include the Colless and Sackin imbalance, which describe the asymmetry of a tree. In other contexts, network science has become an important paradigm for describing structural features of networks and using them to understand complex systems, ranging from protein interactions to social systems. Network science is thus a potential source of many novel ways to characterize tree shape, as trees are also networks. Here, we tailor tools from network science, including diameter, average path length, and betweenness, closeness, and eigenvector centrality, to summarize phylogenetic tree shapes. We thereby propose tree shape summaries that are complementary to both asymmetry and the frequencies of small configurations. These new statistics can be computed in linear time and scale well to describe the shapes of large trees. We apply these statistics, alongside some conventional tree statistics, to phylogenetic trees from three very different viruses (HIV, dengue fever and measles), from the same virus in different epidemiological scenarios (influenza A and HIV) and from simulation models known to produce trees with different shapes. Using mutual information and supervised learning algorithms, we find that the statistics adapted from network science perform as well as or better than conventional statistics. We describe their distributions and prove some basic results about their extreme values in a tree. We conclude that network science-based tree shape summaries are a promising addition to the toolkit of tree shape features. All our shape summaries, as well as functions to select the most discriminating ones for two sets of trees, are freely available as an R package at http://github.com/Leonardini/treeCentrality.
Collapse
Affiliation(s)
- Leonid Chindelevitch
- MRC Centre for Global Infectious Disease Analysis, Imperial College London, London, United Kingdom
- * E-mail:
| | - Maryam Hayati
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
| | - Art F. Y. Poon
- Department of Pathology & Laboratory Medicine, University of Western Ontario, London, ON, Canada
| | - Caroline Colijn
- Department of Mathematics, Simon Fraser University, Burnaby, BC, Canada
| |
Collapse
|
17
|
Gene flow in phylogenomics: Sequence capture resolves species limits and biogeography of Afromontane forest endemic frogs from the Cameroon Highlands. Mol Phylogenet Evol 2021; 163:107258. [PMID: 34252546 DOI: 10.1016/j.ympev.2021.107258] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Revised: 06/28/2021] [Accepted: 07/07/2021] [Indexed: 11/21/2022]
Abstract
Puddle frogs of the Phrynobatrachus steindachneri species complex are a useful group for investigating speciation and phylogeography in Afromontane forests of the Cameroon Volcanic Line, western Central Africa. The species complex is represented by six morphologically relatively cryptic mitochondrial DNA lineages, only two of which are distinguished at the species level - southern P. jimzimkusi and Lake Oku endemic P. njiomock, leaving the remaining four lineages identified as 'P. steindachneri'. In this study, the six mtDNA lineages are subjected to genomic sequence capture analyses and morphological examination to delimit species and to study biogeography. The nuclear DNA data (387 loci; 571,936 aligned base pairs) distinguished all six mtDNA lineages, but the topological pattern and divergence depths supported only four main clades: P. jimzimkusi, P. njiomock, and only two divergent evolutionary lineages within the four 'P. steindachneri' mtDNA lineages. One of the two lineages is herein described as a new species, P. amieti sp. nov. Reticulate evolution (hybridization) was detected within the species complex with morphologically intermediate hybrid individuals placed between the parental species in phylogenomic analyses, forming a ladder-like phylogenetic pattern. The presence of hybrids is undesirable in standard phylogenetic analyses but is essential and beneficial in the network multispecies coalescent. This latter approach provided insight into the reticulate evolutionary history of these endemic frogs. Introgressions likely occurred during the Middle and Late Pleistocene climatic oscillations, due to the cyclic connections (likely dominating during cold glacials) and separations (during warm interglacials) of montane forests. The genomic phylogeographic pattern supports the separation of the southern (Mt. Manengouba to Mt. Oku) and northern mountains at the onset of the Pleistocene. Further subdivisions occurred in the Early Pleistocene, separating populations from the northernmost (Tchabal Mbabo, Gotel Mts.) and middle mountains (Mt. Mbam, Mt. Oku, Mambilla Plateau), as well as the microendemic lineage restricted to Lake Oku (Mt. Oku). This unique model system is highly threatened as all the species within the complex have exhibited severe population declines in the past decade, placing them on the brink of extinction. In addition, Mount Oku is identified to be of particular conservation importance because it harbors three species of this complex. We, therefore, urge for conservation actions in the Cameroon Highlands to preserve their diversity before it is too late.
Collapse
|
18
|
Adams RH, Blackmon H, DeGiorgio M. Of Traits and Trees: Probabilistic Distances under Continuous Trait Models for Dissecting the Interplay among Phylogeny, Model, and Data. Syst Biol 2021; 70:660-680. [PMID: 33587145 PMCID: PMC8208806 DOI: 10.1093/sysbio/syab009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2019] [Accepted: 02/01/2021] [Indexed: 12/03/2022] Open
Abstract
Stochastic models of character trait evolution have become a cornerstone of evolutionary biology in an array of contexts. While probabilistic models have been used extensively for statistical inference, they have largely been ignored for the purpose of measuring distances between phylogeny-aware models. Recent contributions to the problem of phylogenetic distance computation have highlighted the importance of explicitly considering evolutionary model parameters and their impacts on molecular sequence data when quantifying dissimilarity between trees. By comparing two phylogenies in terms of their induced probability distributions that are functions of many model parameters, these distances can be more informative than traditional approaches that rely strictly on differences in topology or branch lengths alone. Currently, however, these approaches are designed for comparing models of nucleotide substitution and gene tree distributions, and thus, are unable to address other classes of traits and associated models that may be of interest to evolutionary biologists. Here, we expand the principles of probabilistic phylogenetic distances to compute tree distances under models of continuous trait evolution along a phylogeny. By explicitly considering both the degree of relatedness among species and the evolutionary processes that collectively give rise to character traits, these distances provide a foundation for comparing models and their predictions, and for quantifying the impacts of assuming one phylogenetic background over another while studying the evolution of a particular trait. We demonstrate the properties of these approaches using theory, simulations, and several empirical data sets that highlight potential uses of probabilistic distances in many scenarios. We also introduce an open-source R package named PRDATR for easy application by the scientific community for computing phylogenetic distances under models of character trait evolution.[Brownian motion; comparative methods; phylogeny; quantitative traits.].
Collapse
Affiliation(s)
- Richard H Adams
- Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Heath Blackmon
- Department of Biology, Texas A&M University, College Station, TX 77843, USA
| | - Michael DeGiorgio
- Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| |
Collapse
|
19
|
Rosenberg NA. On the Colijn-Plazzotta numbering scheme for unlabeled binary rooted trees. DISCRETE APPLIED MATHEMATICS (AMSTERDAM, NETHERLANDS : 1988) 2021; 291:88-98. [PMID: 33364668 PMCID: PMC7751944 DOI: 10.1016/j.dam.2020.11.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Colijn & Plazzotta (Syst. Biol. 67:113-126, 2018) introduced a scheme for bijectively associating the unlabeled binary rooted trees with the positive integers. First, the rank 1 is associated with the 1-leaf tree. Proceeding recursively, ordered pair (k 1, k 2), k 1 ⩾ k 2 ⩾ 1, is then associated with the tree whose left subtree has rank k 1 and whose right subtree has rank k 2. Following dictionary order on ordered pairs, the tree whose left and right subtrees have the ordered pair of ranks (k 1, k 2) is assigned rank k 1(k 1 - 1)/2 + 1 + k 2. With this ranking, given a number of leaves n, we determine recursions for a n , the smallest rank assigned to some tree with n leaves, and b n , the largest rank assigned to some tree with n leaves. The smallest rank a n is assigned to the maximally balanced tree, and the largest rank b n is assigned to the caterpillar. For n equal to a power of 2, the value of a n is seen to increase exponentially with 2α n for a constant α ≈ 1.24602; more generally, we show it is bounded a n < 1.5 n . The value of b n is seen to increase with 2 β ( 2 n ) for a constant β ≈ 1.05653. The great difference in the rates of increase for a n and b n indicates that as the index v is incremented, the number of leaves for the tree associated with rank v quickly traverses a wide range of values. We interpret the results in relation to applications in evolutionary biology.
Collapse
Affiliation(s)
- Noah A Rosenberg
- Department of Biology, Stanford University, Stanford, CA 94305 USA
| |
Collapse
|
20
|
Sallam M, Ababneh NA, Dababseh D, Bakri FG, Mahafzah A. Temporal increase in D614G mutation of SARS-CoV-2 in the Middle East and North Africa. Heliyon 2021; 7:e06035. [PMID: 33495741 PMCID: PMC7817394 DOI: 10.1016/j.heliyon.2021.e06035] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Revised: 12/06/2020] [Accepted: 01/15/2021] [Indexed: 02/09/2023] Open
Abstract
BACKGROUND Phylogeny construction can help to reveal evolutionary relatedness among molecular sequences. The spike (S) gene of SARS-CoV-2 is the subject of an immune selective pressure which increases the variability in such region. This study aimed to identify mutations in the S gene among SARS-CoV-2 sequences collected in the Middle East and North Africa (MENA), focusing on the D614G mutation, that has a presumed fitness advantage. Another aim was to analyze the S gene sequences phylogenetically. METHODS The SARS-CoV-2 S gene sequences collected in the MENA were retrieved from the GISAID public database, together with its metadata. Mutation analysis was conducted in Molecular Evolutionary Genetics Analysis software. Phylogenetic analysis was done using maximum likelihood (ML) and Bayesian methods. RESULT A total of 553 MENA sequences were analyzed and the most frequent S gene mutations included: D614G = 435, Q677H = 8, and V6F = 5. A significant increase in the proportion of D614G was noticed from (63.0%) in February 2020, to (98.5%) in June 2020 (p < 0.001). Two large phylogenetic clusters were identified via ML analysis, which showed an evidence of inter-country mixing of sequences, which dated back to February 8, 2020 and March 15, 2020 (median estimates). The mean evolutionary rate for SARS-CoV-2 was about 6.5 × 10-3 substitutions/site/year based on large clusters' Bayesian analyses. CONCLUSIONS The D614G mutation appeared to be taking over the COVID-19 infections in the MENA. Bayesian analysis suggested that SARS-CoV-2 might have been circulating in MENA earlier than previously reported.
Collapse
Affiliation(s)
- Malik Sallam
- Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, Jordan
- Department of Clinical Laboratories and Forensic Medicine, Jordan University Hospital, Amman, Jordan
- Department of Translational Medicine, Faculty of Medicine, Lund University, Malmö, Sweden
| | - Nidaa A. Ababneh
- Cell Therapy Center (CTC), The University of Jordan, Amman, Jordan
| | - Deema Dababseh
- School of Dentistry, The University of Jordan, Amman, Jordan
| | - Faris G. Bakri
- Department of Internal Medicine, School of Medicine, The University of Jordan, Amman, Jordan
- Department of Internal Medicine, Jordan University Hospital, Amman, Jordan
- Infectious Diseases and Vaccine Center, University of Jordan, Amman, Jordan
| | - Azmi Mahafzah
- Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, Jordan
- Department of Clinical Laboratories and Forensic Medicine, Jordan University Hospital, Amman, Jordan
| |
Collapse
|
21
|
Briand S, Dessimoz C, El-Mabrouk N, Lafond M, Lobinska G. A generalized Robinson-Foulds distance for labeled trees. BMC Genomics 2020; 21:779. [PMID: 33208096 PMCID: PMC7677779 DOI: 10.1186/s12864-020-07011-0] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The Robinson-Foulds (RF) distance is a well-established measure between phylogenetic trees. Despite a lack of biological justification, it has the advantages of being a proper metric and being computable in linear time. For phylogenetic applications involving genes, however, a crucial aspect of the trees ignored by the RF metric is the type of the branching event (e.g. speciation, duplication, transfer, etc). RESULTS We extend RF to trees with labeled internal nodes by including a node flip operation, alongside edge contractions and extensions. We explore properties of this extended RF distance in the case of a binary labeling. In particular, we show that contrary to the unlabeled case, an optimal edit path may require contracting "good" edges, i.e. edges shared between the two trees. CONCLUSIONS We provide a 2-approximation algorithm which is shown to perform well empirically. Looking ahead, computing distances between labeled trees opens up a variety of new algorithmic directions.Implementation and simulations available at https://github.com/DessimozLab/pylabeledrf .
Collapse
Affiliation(s)
- Samuel Briand
- Computer Science Department, Université de Montréal, Montreal, Canada
| | - Christophe Dessimoz
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland. .,Department of Genetics Evolution and Environment, University College London, London, UK. .,Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland. .,Swiss Institute of Bioinformatics, Lausanne, Switzerland. .,Department of Computer Science, University College London, London, UK.
| | - Nadia El-Mabrouk
- Computer Science Department, Université de Montréal, Montreal, Canada.
| | - Manuel Lafond
- Computer Science Department, Université de Sherbrooke, Sherbrooke, Canada
| | - Gabriela Lobinska
- Department of Genetics Evolution and Environment, University College London, London, UK
| |
Collapse
|
22
|
Abstract
Genealogical tree modeling is essential for estimating evolutionary parameters in population genetics and phylogenetics. Recent mathematical results concerning ranked genealogies without leaf labels unlock opportunities in the analysis of evolutionary trees. In particular, comparisons between ranked genealogies facilitate the study of evolutionary processes of different organisms sampled at multiple time periods. We propose metrics on ranked tree shapes and ranked genealogies for lineages isochronously and heterochronously sampled. Our proposed tree metrics make it possible to conduct statistical analyses of ranked tree shapes and timed ranked tree shapes or ranked genealogies. Such analyses allow us to assess differences in tree distributions, quantify estimation uncertainty, and summarize tree distributions. We show the utility of our metrics via simulations and an application in infectious diseases.
Collapse
Affiliation(s)
- Jaehee Kim
- Department of Biology, Stanford University, Stanford, CA 94305
| | | | - Julia A Palacios
- Department of Statistics, Stanford University, Stanford, CA 94305;
- Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA 94305
| |
Collapse
|
23
|
Dearlove B, Lewitus E, Bai H, Li Y, Reeves DB, Joyce MG, Scott PT, Amare MF, Vasan S, Michael NL, Modjarrad K, Rolland M. A SARS-CoV-2 vaccine candidate would likely match all currently circulating variants. Proc Natl Acad Sci U S A 2020; 117:23652-23662. [PMID: 32868447 PMCID: PMC7519301 DOI: 10.1073/pnas.2008281117] [Citation(s) in RCA: 148] [Impact Index Per Article: 29.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
The magnitude of the COVID-19 pandemic underscores the urgency for a safe and effective vaccine. Many vaccine candidates focus on the Spike protein, as it is targeted by neutralizing antibodies and plays a key role in viral entry. Here we investigate the diversity seen in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequences and compare it to the sequence on which most vaccine candidates are based. Using 18,514 sequences, we perform phylogenetic, population genetics, and structural bioinformatics analyses. We find limited diversity across SARS-CoV-2 genomes: Only 11 sites show polymorphisms in >5% of sequences; yet two mutations, including the D614G mutation in Spike, have already become consensus. Because SARS-CoV-2 is being transmitted more rapidly than it evolves, the viral population is becoming more homogeneous, with a median of seven nucleotide substitutions between genomes. There is evidence of purifying selection but little evidence of diversifying selection, with substitution rates comparable across structural versus nonstructural genes. Finally, the Wuhan-Hu-1 reference sequence for the Spike protein, which is the basis for different vaccine candidates, matches optimized vaccine inserts, being identical to an ancestral sequence and one mutation away from the consensus. While the rapid spread of the D614G mutation warrants further study, our results indicate that drift and bottleneck events can explain the minimal diversity found among SARS-CoV-2 sequences. These findings suggest that a single vaccine candidate should be efficacious against currently circulating lineages.
Collapse
Affiliation(s)
- Bethany Dearlove
- Emerging Infectious Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910
- US Military HIV Research Program, Walter Reed Army Institute of Research, Silver Spring, MD 20910
- Henry M. Jackson Foundation for the Advancement of Military Medicine, Bethesda, MD 20817
- Center for Infectious Diseases Research, Walter Reed Army Institute of Research, Silver Spring, MD 20910
| | - Eric Lewitus
- Emerging Infectious Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910
- US Military HIV Research Program, Walter Reed Army Institute of Research, Silver Spring, MD 20910
- Henry M. Jackson Foundation for the Advancement of Military Medicine, Bethesda, MD 20817
- Center for Infectious Diseases Research, Walter Reed Army Institute of Research, Silver Spring, MD 20910
| | - Hongjun Bai
- Emerging Infectious Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910
- US Military HIV Research Program, Walter Reed Army Institute of Research, Silver Spring, MD 20910
- Henry M. Jackson Foundation for the Advancement of Military Medicine, Bethesda, MD 20817
- Center for Infectious Diseases Research, Walter Reed Army Institute of Research, Silver Spring, MD 20910
| | - Yifan Li
- Emerging Infectious Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910
- US Military HIV Research Program, Walter Reed Army Institute of Research, Silver Spring, MD 20910
- Henry M. Jackson Foundation for the Advancement of Military Medicine, Bethesda, MD 20817
- Center for Infectious Diseases Research, Walter Reed Army Institute of Research, Silver Spring, MD 20910
| | - Daniel B Reeves
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109
| | - M Gordon Joyce
- Emerging Infectious Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910
- Henry M. Jackson Foundation for the Advancement of Military Medicine, Bethesda, MD 20817
| | - Paul T Scott
- Emerging Infectious Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910
| | - Mihret F Amare
- Emerging Infectious Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910
- Henry M. Jackson Foundation for the Advancement of Military Medicine, Bethesda, MD 20817
| | - Sandhya Vasan
- US Military HIV Research Program, Walter Reed Army Institute of Research, Silver Spring, MD 20910
- Henry M. Jackson Foundation for the Advancement of Military Medicine, Bethesda, MD 20817
- Center for Infectious Diseases Research, Walter Reed Army Institute of Research, Silver Spring, MD 20910
| | - Nelson L Michael
- Center for Infectious Diseases Research, Walter Reed Army Institute of Research, Silver Spring, MD 20910
| | - Kayvon Modjarrad
- Emerging Infectious Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910;
- Center for Infectious Diseases Research, Walter Reed Army Institute of Research, Silver Spring, MD 20910
| | - Morgane Rolland
- Emerging Infectious Diseases Branch, Walter Reed Army Institute of Research, Silver Spring, MD 20910;
- US Military HIV Research Program, Walter Reed Army Institute of Research, Silver Spring, MD 20910
- Henry M. Jackson Foundation for the Advancement of Military Medicine, Bethesda, MD 20817
- Center for Infectious Diseases Research, Walter Reed Army Institute of Research, Silver Spring, MD 20910
| |
Collapse
|
24
|
Inter- and intraspecies comparison of phylogenetic fingerprints and sequence diversity of immunoglobulin variable genes. Immunogenetics 2020; 72:279-294. [PMID: 32367185 DOI: 10.1007/s00251-020-01164-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2020] [Accepted: 04/13/2020] [Indexed: 10/24/2022]
Abstract
Protection and neutralization of a vast array of pathogens is accomplished by the tremendous diversity of the B cell receptor (BCR) repertoire. For jawed vertebrates, this diversity is initiated via the somatic recombination of immunoglobulin (Ig) germline elements. While it is clear that the number of these germline segments differs from species to species, the extent of cross-species sequence diversity remains largely uncharacterized. Here we use extensive computational and statistical methods to investigate the sequence diversity and evolutionary relationship between Ig variable (V), diversity (D), and joining (J) germline segments across nine commonly studied species ranging from zebrafish to human. Metrics such as guanine-cytosine (GC) content showed low redundancy across Ig germline genes within a given species. Other comparisons, including amino acid motifs, evolutionary selection, and sequence diversity, revealed species-specific properties. Additionally, we showed that the germline-encoded diversity differs across antibody (recombined V-D-J) repertoires of various B cell subsets. To facilitate future comparative immunogenomics analysis, we created VDJgermlines, an R package that contains the germline sequences from multiple species. Our study informs strategies for the humanization and engineering of therapeutic antibodies.
Collapse
|
25
|
Scale-invariant topology and bursty branching of evolutionary trees emerge from niche construction. Proc Natl Acad Sci U S A 2020; 117:7879-7887. [PMID: 32209672 DOI: 10.1073/pnas.1915088117] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
Phylogenetic trees describe both the evolutionary process and community diversity. Recent work has established that they exhibit scale-invariant topology, which quantifies the fact that their branching lies in between the two extreme cases of balanced binary trees and maximally unbalanced ones. In addition, the backbones of phylogenetic trees exhibit bursts of diversification on all timescales. Here, we present a simple, coarse-grained statistical model of niche construction coupled to speciation. Finite-size scaling analysis of the dynamics shows that the resultant phylogenetic tree topology is scale-invariant due to a singularity arising from large niche construction fluctuations that follow extinction events. The same model recapitulates the bursty pattern of diversification in time. These results show how dynamical scaling laws of phylogenetic trees on long timescales can reflect the indelible imprint of the interplay between ecological and evolutionary processes.
Collapse
|
26
|
Hong SL, Dellicour S, Vrancken B, Suchard MA, Pyne MT, Hillyard DR, Lemey P, Baele G. In Search of Covariates of HIV-1 Subtype B Spread in the United States-A Cautionary Tale of Large-Scale Bayesian Phylogeography. Viruses 2020; 12:v12020182. [PMID: 32033422 PMCID: PMC7077180 DOI: 10.3390/v12020182] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Revised: 01/24/2020] [Accepted: 01/28/2020] [Indexed: 12/21/2022] Open
Abstract
Infections with HIV-1 group M subtype B viruses account for the majority of the HIV epidemic in the Western world. Phylogeographic studies have placed the introduction of subtype B in the United States in New York around 1970, where it grew into a major source of spread. Currently, it is estimated that over one million people are living with HIV in the US and that most are infected with subtype B variants. Here, we aim to identify the drivers of HIV-1 subtype B dispersal in the United States by analyzing a collection of 23,588 pol sequences, collected for drug resistance testing from 45 states during 2004-2011. To this end, we introduce a workflow to reduce this large collection of data to more computationally-manageable sample sizes and apply the BEAST framework to test which covariates associate with the spread of HIV-1 across state borders. Our results show that we are able to consistently identify certain predictors of spread under reasonable run times across datasets of up to 10,000 sequences. However, the general lack of phylogenetic structure and the high uncertainty associated with HIV trees make it difficult to interpret the epidemiological relevance of the drivers of spread we are able to identify. While the workflow we present here could be applied to other virus datasets of a similar scale, the characteristic star-like shape of HIV-1 phylogenies poses a serious obstacle to reconstructing a detailed evolutionary and spatial history for HIV-1 subtype B in the US.
Collapse
Affiliation(s)
- Samuel L. Hong
- Department of Microbiology, Immunology and Transplantation, Rega Institute for Medical Research, KU Leuven, 3000 Leuven, Belgium; (S.D.); (B.V.); (P.L.); (G.B.)
- Correspondence:
| | - Simon Dellicour
- Department of Microbiology, Immunology and Transplantation, Rega Institute for Medical Research, KU Leuven, 3000 Leuven, Belgium; (S.D.); (B.V.); (P.L.); (G.B.)
- Spatial Epidemiology Lab (SpELL), Université Libre de Bruxelles, 1050 Brussels, Belgium
| | - Bram Vrancken
- Department of Microbiology, Immunology and Transplantation, Rega Institute for Medical Research, KU Leuven, 3000 Leuven, Belgium; (S.D.); (B.V.); (P.L.); (G.B.)
| | - Marc A. Suchard
- Department of Biomathematics, David Geffen School of Medicine at UCLA, University of California, Los Angeles, CA 90095, USA;
- Department of Human Genetics, David Geffen School of Medicine at UCLA, University of California, Los Angeles, CA 90095, USA
- Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles, CA 90095, USA
| | - Michael T. Pyne
- ARUP Institute for Clinical and Experimental Pathology, Salt Lake City, UT 84108, USA;
| | - David R. Hillyard
- Department of Pathology, University of Utah, Salt Lake City, UT 84112, USA;
| | - Philippe Lemey
- Department of Microbiology, Immunology and Transplantation, Rega Institute for Medical Research, KU Leuven, 3000 Leuven, Belgium; (S.D.); (B.V.); (P.L.); (G.B.)
| | - Guy Baele
- Department of Microbiology, Immunology and Transplantation, Rega Institute for Medical Research, KU Leuven, 3000 Leuven, Belgium; (S.D.); (B.V.); (P.L.); (G.B.)
| |
Collapse
|
27
|
Avino M, Ng GT, He Y, Renaud MS, Jones BR, Poon AFY. Tree shape-based approaches for the comparative study of cophylogeny. Ecol Evol 2019; 9:6756-6771. [PMID: 31312429 PMCID: PMC6618157 DOI: 10.1002/ece3.5185] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2019] [Revised: 02/21/2019] [Accepted: 03/29/2019] [Indexed: 12/17/2022] Open
Abstract
Cophylogeny is the congruence of phylogenetic relationships between two different groups of organisms due to their long-term interaction. We investigated the use of tree shape distance measures to quantify the degree of cophylogeny. We implemented a reverse-time simulation model of pathogen phylogenies within a fixed host tree, given cospeciation probability, host switching, and pathogen speciation rates. We used this model to evaluate 18 distance measures between host and pathogen trees including two kernel distances that we developed for labeled and unlabeled trees, which use branch lengths and accommodate different size trees. Finally, we used these measures to revisit published cophylogenetic studies, where authors described the observed associations as representing a high or low degree of cophylogeny. Our simulations demonstrated that some measures are more informative than others with respect to specific coevolution parameters especially when these did not assume extreme values. For real datasets, trees' associations projection revealed clustering of high concordance studies suggesting that investigators are describing it in a consistent way. Our results support the hypothesis that measures can be useful for quantifying cophylogeny. This motivates their usage in the field of coevolution and supports the development of simulation-based methods, i.e., approximate Bayesian computation, to estimate the underlying coevolutionary parameters.
Collapse
Affiliation(s)
- Mariano Avino
- Department of Pathology and Laboratory Medicine Western University London Ontario Canada
| | - Garway T Ng
- Department of Pathology and Laboratory Medicine Western University London Ontario Canada
| | - Yiying He
- Department of Pathology and Laboratory Medicine Western University London Ontario Canada
| | - Mathias S Renaud
- Department of Pathology and Laboratory Medicine Western University London Ontario Canada
| | - Bradley R Jones
- BC Centre for Excellence in HIV/AIDS Vancouver British Columbia Canada
| | - Art F Y Poon
- Department of Pathology and Laboratory Medicine Western University London Ontario Canada.,Department of Applied Mathematics Western University London Ontario Canada
| |
Collapse
|