1
|
Kramer AM, Thornlow B, Ye C, De Maio N, McBroome J, Hinrichs AS, Lanfear R, Turakhia Y, Corbett-Detig R. Online Phylogenetics with matOptimize Produces Equivalent Trees and is Dramatically More Efficient for Large SARS-CoV-2 Phylogenies than de novo and Maximum-Likelihood Implementations. Syst Biol 2023; 72:1039-1051. [PMID: 37232476 PMCID: PMC10627557 DOI: 10.1093/sysbio/syad031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Revised: 05/14/2023] [Accepted: 06/22/2023] [Indexed: 05/27/2023] Open
Abstract
Phylogenetics has been foundational to SARS-CoV-2 research and public health policy, assisting in genomic surveillance, contact tracing, and assessing emergence and spread of new variants. However, phylogenetic analyses of SARS-CoV-2 have often relied on tools designed for de novo phylogenetic inference, in which all data are collected before any analysis is performed and the phylogeny is inferred once from scratch. SARS-CoV-2 data sets do not fit this mold. There are currently over 14 million sequenced SARS-CoV-2 genomes in online databases, with tens of thousands of new genomes added every day. Continuous data collection, combined with the public health relevance of SARS-CoV-2, invites an "online" approach to phylogenetics, in which new samples are added to existing phylogenetic trees every day. The extremely dense sampling of SARS-CoV-2 genomes also invites a comparison between likelihood and parsimony approaches to phylogenetic inference. Maximum likelihood (ML) and pseudo-ML methods may be more accurate when there are multiple changes at a single site on a single branch, but this accuracy comes at a large computational cost, and the dense sampling of SARS-CoV-2 genomes means that these instances will be extremely rare because each internal branch is expected to be extremely short. Therefore, it may be that approaches based on maximum parsimony (MP) are sufficiently accurate for reconstructing phylogenies of SARS-CoV-2, and their simplicity means that they can be applied to much larger data sets. Here, we evaluate the performance of de novo and online phylogenetic approaches, as well as ML, pseudo-ML, and MP frameworks for inferring large and dense SARS-CoV-2 phylogenies. Overall, we find that online phylogenetics produces similar phylogenetic trees to de novo analyses for SARS-CoV-2, and that MP optimization with UShER and matOptimize produces equivalent SARS-CoV-2 phylogenies to some of the most popular ML and pseudo-ML inference tools. MP optimization with UShER and matOptimize is thousands of times faster than presently available implementations of ML and online phylogenetics is faster than de novo inference. Our results therefore suggest that parsimony-based methods like UShER and matOptimize represent an accurate and more practical alternative to established ML implementations for large SARS-CoV-2 phylogenies and could be successfully applied to other similar data sets with particularly dense sampling and short branch lengths.
Collapse
Affiliation(s)
- Alexander M Kramer
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Bryan Thornlow
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Cheng Ye
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, CA 92093, USA
| | - Nicola De Maio
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge CB10 1SD, UK
| | - Jakob McBroome
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Angie S Hinrichs
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Robert Lanfear
- Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, ACT 2601, Australia
| | - Yatish Turakhia
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, CA 92093, USA
| | - Russell Corbett-Detig
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| |
Collapse
|
2
|
Thornlow B, Kramer A, Ye C, De Maio N, McBroome J, Hinrichs AS, Lanfear R, Turakhia Y, Corbett-Detig R. Online Phylogenetics using Parsimony Produces Slightly Better Trees and is Dramatically More Efficient for Large SARS-CoV-2 Phylogenies than de novo and Maximum-Likelihood Approaches. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2022:2021.12.02.471004. [PMID: 35611334 PMCID: PMC9128781 DOI: 10.1101/2021.12.02.471004] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Phylogenetics has been foundational to SARS-CoV-2 research and public health policy, assisting in genomic surveillance, contact tracing, and assessing emergence and spread of new variants. However, phylogenetic analyses of SARS-CoV-2 have often relied on tools designed for de novo phylogenetic inference, in which all data are collected before any analysis is performed and the phylogeny is inferred once from scratch. SARS-CoV-2 datasets do not fit this mould. There are currently over 10 million sequenced SARS-CoV-2 genomes in online databases, with tens of thousands of new genomes added every day. Continuous data collection, combined with the public health relevance of SARS-CoV-2, invites an "online" approach to phylogenetics, in which new samples are added to existing phylogenetic trees every day. The extremely dense sampling of SARS-CoV-2 genomes also invites a comparison between likelihood and parsimony approaches to phylogenetic inference. Maximum likelihood (ML) methods are more accurate when there are multiple changes at a single site on a single branch, but this accuracy comes at a large computational cost, and the dense sampling of SARS-CoV-2 genomes means that these instances will be extremely rare because each internal branch is expected to be extremely short. Therefore, it may be that approaches based on maximum parsimony (MP) are sufficiently accurate for reconstructing phylogenies of SARS-CoV-2, and their simplicity means that they can be applied to much larger datasets. Here, we evaluate the performance of de novo and online phylogenetic approaches, and ML and MP frameworks, for inferring large and dense SARS-CoV-2 phylogenies. Overall, we find that online phylogenetics produces similar phylogenetic trees to de novo analyses for SARS-CoV-2, and that MP optimizations produce more accurate SARS-CoV-2 phylogenies than do ML optimizations. Since MP is thousands of times faster than presently available implementations of ML and online phylogenetics is faster than de novo , we therefore propose that, in the context of comprehensive genomic epidemiology of SARS-CoV-2, MP online phylogenetics approaches should be favored.
Collapse
Affiliation(s)
- Bryan Thornlow
- Department of Biomolecular Engineering, University of California, Santa Cruz; Santa Cruz, CA 95064, USA
- Genomics Institute, University of California, Santa Cruz; Santa Cruz, CA 95064, USA
| | - Alexander Kramer
- Department of Biomolecular Engineering, University of California, Santa Cruz; Santa Cruz, CA 95064, USA
- Genomics Institute, University of California, Santa Cruz; Santa Cruz, CA 95064, USA
| | - Cheng Ye
- Department of Electrical and Computer Engineering, University of California, San Diego; San Diego, CA 92093, USA
| | - Nicola De Maio
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus; Cambridge CB10 1SD, UK
| | - Jakob McBroome
- Department of Biomolecular Engineering, University of California, Santa Cruz; Santa Cruz, CA 95064, USA
- Genomics Institute, University of California, Santa Cruz; Santa Cruz, CA 95064, USA
| | - Angie S. Hinrichs
- Genomics Institute, University of California, Santa Cruz; Santa Cruz, CA 95064, USA
| | - Robert Lanfear
- Department of Ecology and Evolution, Research School of Biology, Australian National University; Canberra, ACT 2601, Australia
| | - Yatish Turakhia
- Department of Electrical and Computer Engineering, University of California, San Diego; San Diego, CA 92093, USA
| | - Russell Corbett-Detig
- Department of Biomolecular Engineering, University of California, Santa Cruz; Santa Cruz, CA 95064, USA
- Genomics Institute, University of California, Santa Cruz; Santa Cruz, CA 95064, USA
| |
Collapse
|
3
|
Sánchez-Reyes LL, Kandziora M, McTavish EJ. Physcraper: a Python package for continually updated phylogenetic trees using the Open Tree of Life. BMC Bioinformatics 2021; 22:355. [PMID: 34187366 PMCID: PMC8244228 DOI: 10.1186/s12859-021-04274-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2021] [Accepted: 06/16/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Phylogenies are a key part of research in many areas of biology. Tools that automate some parts of the process of phylogenetic reconstruction, mainly molecular character matrix assembly, have been developed for the advantage of both specialists in the field of phylogenetics and non-specialists. However, interpretation of results, comparison with previously available phylogenetic hypotheses, and selection of one phylogeny for downstream analyses and discussion still impose difficulties to one that is not a specialist either on phylogenetic methods or on a particular group of study. RESULTS Physcraper is a command-line Python program that automates the update of published phylogenies by adding public DNA sequences to underlying alignments of previously published phylogenies. It also provides a framework for straightforward comparison of published phylogenies with their updated versions, by leveraging upon tools from the Open Tree of Life project to link taxonomic information across databases. The program can be used by the nonspecialist, as a tool to generate phylogenetic hypotheses based on publicly available expert phylogenetic knowledge. Phylogeneticists and taxonomic group specialists will find it useful as a tool to facilitate molecular dataset gathering and comparison of alternative phylogenetic hypotheses (topologies). CONCLUSION The Physcraper workflow showcases the benefits of doing open science for phylogenetics, encouraging researchers to strive for better scientific sharing practices. Physcraper can be used with any OS and is released under an open-source license. Detailed instructions for installation and usage are available at https://physcraper.readthedocs.io.
Collapse
Affiliation(s)
| | - Martha Kandziora
- School of Natural Sciences, University of California, Merced, USA.,Department of Botany, Faculty of Science, Charles University, Prague, Czech Republic
| | | |
Collapse
|
4
|
Hu D, Liu B, Wang L, Reeves PR. Living Trees: High-Quality Reproducible and Reusable Construction of Bacterial Phylogenetic Trees. Mol Biol Evol 2020; 37:563-575. [PMID: 31633785 DOI: 10.1093/molbev/msz241] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
An ideal bacterial phylogenetic tree accurately retraces evolutionary history and accurately incorporates mutational, recombination and other events on the appropriate branches. Current strain-level bacterial phylogenetic analysis based on large numbers of genomes lacks reliability and resolution, and is hard to be replicated, confirmed and reused, because of the highly divergent nature of microbial genomes. We present SNPs and Recombination Events Tree (SaRTree), a pipeline using six "living trees" modules that addresses problems arising from the high numbers and variable quality of bacterial genome sequences. It provides for reuse of the tree and offers a major step toward global standardization of phylogenetic analysis by generating deposit files including all steps involved in phylogenetic inference. The tree itself is a "living tree" that can be extended by addition of more sequences, or the deposit can be used to vary the programs or parameters used, to assess the effect of such changes. This approach will allow phylogeny papers to meet the traditional responsibility of providing data and analysis that can be repeated and critically evaluated by others. We used the Acinetobacter baumannii global clone I to illustrate use of SaRTree to optimize tree resolution. An Escherichia coli tree was built from 351 sequences selected from 11,162 genome sequences, with the others added back onto well-defined branches, to show how this facility can greatly improve the outcomes from genome sequencing. SaRTree is designed for prokaryote strain-level analysis but could be adapted for other usage.
Collapse
Affiliation(s)
- Dalong Hu
- School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, Australia
| | - Bin Liu
- TEDA Institute of Biological Sciences and Biotechnology, Nankai University, Tianjin Economic-Technological Development Area, Tianjin, People's Republic of China.,Tianjin Research Center for Functional Genomics and Biochip, Tianjin, People's Republic of China
| | - Lei Wang
- TEDA Institute of Biological Sciences and Biotechnology, Nankai University, Tianjin Economic-Technological Development Area, Tianjin, People's Republic of China.,Ministry of Education, The Key Laboratory of Molecular Microbiology and Technology, Tianjin, People's Republic of China.,Tianjin Key Laboratory of Microbial Functional Genomics, Tianjin, People's Republic of China.,State Key Laboratory of Medicinal Chemical Biology, Nankai University, Tianjin, People's Republic of China
| | - Peter R Reeves
- School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, Australia
| |
Collapse
|
5
|
Gill MS, Lemey P, Suchard MA, Rambaut A, Baele G. Online Bayesian Phylodynamic Inference in BEAST with Application to Epidemic Reconstruction. Mol Biol Evol 2020; 37:1832-1842. [PMID: 32101295 PMCID: PMC7253210 DOI: 10.1093/molbev/msaa047] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Reconstructing pathogen dynamics from genetic data as they become available during an outbreak or epidemic represents an important statistical scenario in which observations arrive sequentially in time and one is interested in performing inference in an "online" fashion. Widely used Bayesian phylogenetic inference packages are not set up for this purpose, generally requiring one to recompute trees and evolutionary model parameters de novo when new data arrive. To accommodate increasing data flow in a Bayesian phylogenetic framework, we introduce a methodology to efficiently update the posterior distribution with newly available genetic data. Our procedure is implemented in the BEAST 1.10 software package, and relies on a distance-based measure to insert new taxa into the current estimate of the phylogeny and imputes plausible values for new model parameters to accommodate growing dimensionality. This augmentation creates informed starting values and re-uses optimally tuned transition kernels for posterior exploration of growing data sets, reducing the time necessary to converge to target posterior distributions. We apply our framework to data from the recent West African Ebola virus epidemic and demonstrate a considerable reduction in time required to obtain posterior estimates at different time points of the outbreak. Beyond epidemic monitoring, this framework easily finds other applications within the phylogenetics community, where changes in the data-in terms of alignment changes, sequence addition or removal-present common scenarios that can benefit from online inference.
Collapse
Affiliation(s)
- Mandev S Gill
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| | - Philippe Lemey
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| | - Marc A Suchard
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA
- Department of Biostatistics, School of Public Health, University of California, Los Angeles, CA
- Department of Biomathematics, David Geffen School of Medicine, University of California, Los Angeles, CA
| | - Andrew Rambaut
- Institute of Evolutionary Biology, University of Edinburgh, United Kingdom
- Fogarty International Center, National Institutes of Health, Bethesda, MD
| | - Guy Baele
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| |
Collapse
|
6
|
Fang Y, Liu C, Lin J, Li X, Alavian KN, Yang Y, Niu Y. PhySpeTree: an automated pipeline for reconstructing phylogenetic species trees. BMC Evol Biol 2019; 19:219. [PMID: 31791235 PMCID: PMC6889546 DOI: 10.1186/s12862-019-1541-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2019] [Accepted: 11/13/2019] [Indexed: 02/05/2023] Open
Abstract
Background Phylogenetic species trees are widely used in inferring evolutionary relationships. Existing software and algorithms mainly focus on phylogenetic inference. However, less attention has been paid to intermediate steps, such as processing extremely large sequences and preparing configure files to connect multiple software. When the species number is large, the intermediate steps become a bottleneck that may seriously affect the efficiency of tree building. Results Here, we present an easy-to-use pipeline named PhySpeTree to facilitate the reconstruction of species trees across bacterial, archaeal, and eukaryotic organisms. Users need only to input the abbreviations of species names; PhySpeTree prepares complex configure files for different software, then automatically downloads genomic data, cleans sequences, and builds trees. PhySpeTree allows users to perform critical steps such as sequence alignment and tree construction by adjusting advanced options. PhySpeTree provides two parallel pipelines based on concatenated highly conserved proteins and small subunit ribosomal RNA sequences, respectively. Accessory modules, such as those for inserting new species, generating visualization configurations, and combining trees, are distributed along with PhySpeTree. Conclusions Together with accessory modules, PhySpeTree significantly simplifies tree reconstruction. PhySpeTree is implemented in Python running on modern operating systems (Linux, macOS, and Windows). The source code is freely available with detailed documentation (https://github.com/yangfangs/physpetools).
Collapse
Affiliation(s)
- Yang Fang
- Key Laboratory of Bio-Resources and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, People's Republic of China
| | - Chengcheng Liu
- State Key Laboratory of Oral Diseases & National Clinical Research Center for Oral Diseases &Department of Periodontics, West China Hospital of Stomatology, Sichuan University, Chengdu, China
| | - Jiangyi Lin
- Wu YuZhang Honors College of Sichuan University, Chengdu, People's Republic of China
| | - Xufeng Li
- Key Laboratory of Bio-Resources and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, People's Republic of China
| | - Kambiz N Alavian
- Department of Medicine, Division of Brain Sciences, Imperial College London, London, UK.,Department of Internal Medicine, Endocrinology, Yale University, New Haven, USA
| | - Yi Yang
- Key Laboratory of Bio-Resources and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, People's Republic of China.
| | - Yulong Niu
- Key Laboratory of Bio-Resources and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, People's Republic of China.
| |
Collapse
|
7
|
Fourment M, Claywell BC, Dinh V, McCoy C, Matsen Iv FA, Darling AE. Effective Online Bayesian Phylogenetics via Sequential Monte Carlo with Guided Proposals. Syst Biol 2018; 67:490-502. [PMID: 29186587 PMCID: PMC5920299 DOI: 10.1093/sysbio/syx090] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2017] [Accepted: 11/20/2017] [Indexed: 11/14/2022] Open
Abstract
Modern infectious disease outbreak surveillance produces continuous streams of sequence data which require phylogenetic analysis as data arrives. Current software packages for Bayesian phylogenetic inference are unable to quickly incorporate new sequences as they become available, making them less useful for dynamically unfolding evolutionary stories. This limitation can be addressed by applying a class of Bayesian statistical inference algorithms called sequential Monte Carlo (SMC) to conduct online inference, wherein new data can be continuously incorporated to update the estimate of the posterior probability distribution. In this article, we describe and evaluate several different online phylogenetic sequential Monte Carlo (OPSMC) algorithms. We show that proposing new phylogenies with a density similar to the Bayesian prior suffers from poor performance, and we develop “guided” proposals that better match the proposal density to the posterior. Furthermore, we show that the simplest guided proposals can exhibit pathological behavior in some situations, leading to poor results, and that the situation can be resolved by heating the proposal density. The results demonstrate that relative to the widely used MCMC-based algorithm implemented in MrBayes, the total time required to compute a series of phylogenetic posteriors as sequences arrive can be significantly reduced by the use of OPSMC, without incurring a significant loss in accuracy.
Collapse
Affiliation(s)
- Mathieu Fourment
- ithree institute, University of Technology Sydney, Ultimo, NSW 2007, Australia
| | | | - Vu Dinh
- Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Connor McCoy
- Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | | | - Aaron E Darling
- ithree institute, University of Technology Sydney, Ultimo, NSW 2007, Australia
| |
Collapse
|
8
|
del Campo J, Kolisko M, Boscaro V, Santoferrara LF, Nenarokov S, Massana R, Guillou L, Simpson A, Berney C, de Vargas C, Brown MW, Keeling PJ, Wegener Parfrey L. EukRef: Phylogenetic curation of ribosomal RNA to enhance understanding of eukaryotic diversity and distribution. PLoS Biol 2018; 16:e2005849. [PMID: 30222734 PMCID: PMC6160240 DOI: 10.1371/journal.pbio.2005849] [Citation(s) in RCA: 60] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Revised: 09/27/2018] [Indexed: 01/03/2023] Open
Abstract
Environmental sequencing has greatly expanded our knowledge of micro-eukaryotic diversity and ecology by revealing previously unknown lineages and their distribution. However, the value of these data is critically dependent on the quality of the reference databases used to assign an identity to environmental sequences. Existing databases contain errors and struggle to keep pace with rapidly changing eukaryotic taxonomy, the influx of novel diversity, and computational challenges related to assembling the high-quality alignments and trees needed for accurate characterization of lineage diversity. EukRef (eukref.org) is an ongoing community-driven initiative that addresses these challenges by bringing together taxonomists with expertise spanning the eukaryotic tree of life and microbial ecologists, who use environmental sequence data to develop reliable reference databases across the diversity of microbial eukaryotes. EukRef organizes and facilitates rigorous mining and annotation of sequence data by providing protocols, guidelines, and tools. The EukRef pipeline and tools allow users interested in a particular group of microbial eukaryotes to retrieve all sequences belonging to that group from International Nucleotide Sequence Database Collaboration (INSDC) (GenBank, the European Nucleotide Archive [ENA], or the DNA DataBank of Japan [DDBJ]), to place those sequences in a phylogenetic tree, and to curate taxonomic and environmental information for the group. We provide guidelines to facilitate the process and to standardize taxonomic annotations. The final outputs of this process are (1) a reference tree and alignment, (2) a reference sequence database, including taxonomic and environmental information, and (3) a list of putative chimeras and other artifactual sequences. These products will be useful for the broad community as they become publicly available (at eukref.org) and are shared with existing reference databases.
Collapse
Affiliation(s)
- Javier del Campo
- Department of Marine Biology and Oceanography, Institut de Ciències del Mar—CSIC, Barcelona, Catalonia, Spain
- Department of Botany and Biodiversity Research Centre, University of British Columbia, Vancouver, British Columbia, Canada
- * E-mail:
| | - Martin Kolisko
- Department of Botany and Biodiversity Research Centre, University of British Columbia, Vancouver, British Columbia, Canada
- Institute of Parasitology, Biology Centre, Czech Academy of Sciences, České Budějovice, Czech Republic
| | - Vittorio Boscaro
- Department of Botany and Biodiversity Research Centre, University of British Columbia, Vancouver, British Columbia, Canada
| | - Luciana F. Santoferrara
- Departments of Marine Sciences & Ecology and Evolutionary Biology, University of Connecticut, Storrs, United States of America
| | - Serafim Nenarokov
- Institute of Parasitology, Biology Centre, Czech Academy of Sciences, České Budějovice, Czech Republic
| | - Ramon Massana
- Department of Marine Biology and Oceanography, Institut de Ciències del Mar—CSIC, Barcelona, Catalonia, Spain
| | - Laure Guillou
- Sorbonne Université, CNRS, Station Biologique de Roscoff, UMR7144, Roscoff, France
| | - Alastair Simpson
- Department of Biology, and Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, Nova Scotia, Canada
| | - Cedric Berney
- Sorbonne Université, CNRS, Station Biologique de Roscoff, UMR7144, Roscoff, France
| | - Colomban de Vargas
- Sorbonne Université, CNRS, Station Biologique de Roscoff, UMR7144, Roscoff, France
| | - Matthew W. Brown
- Department of Biological Sciences, Mississippi State University, Mississippi State, Mississippi, United States of America
| | - Patrick J. Keeling
- Department of Botany and Biodiversity Research Centre, University of British Columbia, Vancouver, British Columbia, Canada
| | - Laura Wegener Parfrey
- Department of Botany and Biodiversity Research Centre, University of British Columbia, Vancouver, British Columbia, Canada
- Department of Zoology, University of British Columbia, Vancouver, British Columbia, Canada
| |
Collapse
|
9
|
Borstein SR, O’Meara BC. AnnotationBustR: an R package to extract subsequences from GenBank annotations. PeerJ 2018; 6:e5179. [PMID: 30002984 PMCID: PMC6034590 DOI: 10.7717/peerj.5179] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2017] [Accepted: 06/18/2018] [Indexed: 02/01/2023] Open
Abstract
BACKGROUND DNA sequences are pivotal for a wide array of research in biology. Large sequence databases, like GenBank, provide an amazing resource to utilize DNA sequences for large scale analyses. However, many sequence records on GenBank contain more than one gene or are portions of genomes. Inconsistencies in the way genes are annotated and the numerous synonyms a single gene may be listed under provide major challenges for extracting large numbers of subsequences for comparative analysis across taxa. At present, there is no easy way to extract portions from many GenBank accessions based on annotations where gene names may vary extensively. RESULTS The R package AnnotationBustR allows users to extract sequences based on GenBank annotations through the ACNUC retrieval system given search terms of gene synonyms and accession numbers. AnnotationBustR extracts subsequences of interest and then writes them to a FASTA file for users to employ in their research endeavors. CONCLUSION FASTA files of extracted subsequences and accession tables generated by AnnotationBustR allow users to quickly find and extract subsequences from GenBank accessions. These sequences can then be incorporated in various analyses, like the construction of phylogenies to test a wide range of ecological and evolutionary hypotheses.
Collapse
Affiliation(s)
- Samuel R. Borstein
- Department of Ecology & Evolutionary Biology, University of Tennessee, Knoxville, TN, USA
| | - Brian C. O’Meara
- Department of Ecology & Evolutionary Biology, University of Tennessee, Knoxville, TN, USA
| |
Collapse
|
10
|
Modha S, Thanki AS, Cotmore SF, Davison AJ, Hughes J. ViCTree: an automated framework for taxonomic classification from protein sequences. Bioinformatics 2018; 34:2195-2200. [PMID: 29474519 PMCID: PMC6022645 DOI: 10.1093/bioinformatics/bty099] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2017] [Revised: 01/08/2018] [Accepted: 02/20/2018] [Indexed: 11/14/2022] Open
Abstract
Motivation The increasing rate of submission of genetic sequences into public databases is providing a growing resource for classifying the organisms that these sequences represent. To aid viral classification, we have developed ViCTree, which automatically integrates the relevant sets of sequences in NCBI GenBank and transforms them into an interactive maximum likelihood phylogenetic tree that can be updated automatically. ViCTree incorporates ViCTreeView, which is a JavaScript-based visualization tool that enables the tree to be explored interactively in the context of pairwise distance data. Results To demonstrate utility, ViCTree was applied to subfamily Densovirinae of family Parvoviridae. This led to the identification of six new species of insect virus. Availability and implementation ViCTree is open-source and can be run on any Linux- or Unix-based computer or cluster. A tutorial, the documentation and the source code are available under a GPL3 license, and can be accessed at http://bioinformatics.cvr.ac.uk/victree_web/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sejal Modha
- MRC-University of Glasgow Centre for Virus Research, Glasgow, UK
| | - Anil S Thanki
- Earlham Institute, Norwich Research Park, Norwich, UK
| | | | - Andrew J Davison
- MRC-University of Glasgow Centre for Virus Research, Glasgow, UK
| | - Joseph Hughes
- MRC-University of Glasgow Centre for Virus Research, Glasgow, UK
| |
Collapse
|
11
|
Dinh V, Darling AE, Matsen IV FA. Online Bayesian Phylogenetic Inference: Theoretical Foundations via Sequential Monte Carlo. Syst Biol 2018; 67:503-517. [PMID: 29244177 PMCID: PMC5920340 DOI: 10.1093/sysbio/syx087] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2017] [Revised: 11/08/2017] [Accepted: 11/09/2017] [Indexed: 11/29/2022] Open
Abstract
Phylogenetics, the inference of evolutionary trees from molecular sequence data such as DNA, is an enterprise that yields valuable evolutionary understanding of many biological systems. Bayesian phylogenetic algorithms, which approximate a posterior distribution on trees, have become a popular if computationally expensive means of doing phylogenetics. Modern data collection technologies are quickly adding new sequences to already substantial databases. With all current techniques for Bayesian phylogenetics, computation must start anew each time a sequence becomes available, making it costly to maintain an up-to-date estimate of a phylogenetic posterior. These considerations highlight the need for an online Bayesian phylogenetic method which can update an existing posterior with new sequences. Here, we provide theoretical results on the consistency and stability of methods for online Bayesian phylogenetic inference based on Sequential Monte Carlo (SMC) and Markov chain Monte Carlo. We first show a consistency result, demonstrating that the method samples from the correct distribution in the limit of a large number of particles. Next, we derive the first reported set of bounds on how phylogenetic likelihood surfaces change when new sequences are added. These bounds enable us to characterize the theoretical performance of sampling algorithms by bounding the effective sample size (ESS) with a given number of particles from below. We show that the ESS is guaranteed to grow linearly as the number of particles in an SMC sampler grows. Surprisingly, this result holds even though the dimensions of the phylogenetic model grow with each new added sequence.
Collapse
Affiliation(s)
- Vu Dinh
- Department of Mathematical Sciences, University of Delaware, 312 Ewing Hall, Newark, DE 19716, USA
| | - Aaron E Darling
- The ithree institute, University of Technology Sydney, 15 Broadway, Ultimo NSW 2007, Australia
| | - Frederick A Matsen IV
- Program in Computational Biology, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, WA 98109, USA
| |
Collapse
|
12
|
Smith SA, Brown JW. Constructing a broadly inclusive seed plant phylogeny. AMERICAN JOURNAL OF BOTANY 2018; 105:302-314. [PMID: 29746720 DOI: 10.1002/ajb2.1019] [Citation(s) in RCA: 363] [Impact Index Per Article: 60.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/08/2017] [Accepted: 10/19/2017] [Indexed: 05/03/2023]
Abstract
PREMISE OF THE STUDY Large phylogenies can help shed light on macroevolutionary patterns that inform our understanding of fundamental processes that shape the tree of life. These phylogenies also serve as tools that facilitate other systematic, evolutionary, and ecological analyses. Here we combine genetic data from public repositories (GenBank) with phylogenetic data (Open Tree of Life project) to construct a dated phylogeny for seed plants. METHODS We conducted a hierarchical clustering analysis of publicly available molecular data for major clades within the Spermatophyta. We constructed phylogenies of major clades, estimated divergence times, and incorporated data from the Open Tree of Life project, resulting in a seed plant phylogeny. We estimated diversification rates, excluding those taxa without molecular data. We also summarized topological uncertainty and data overlap for each major clade. KEY RESULTS The trees constructed for Spermatophyta consisted of 79,881 and 353,185 terminal taxa; the latter included the Open Tree of Life taxa for which we could not include molecular data from GenBank. The diversification analyses demonstrated nested patterns of rate shifts throughout the phylogeny. Data overlap and inference uncertainty show significant variation throughout and demonstrate the continued need for data collection across seed plants. CONCLUSIONS This study demonstrates a means for combining available resources to construct a dated phylogeny for plants. However, this approach is an early step and more developments are needed to add data, better incorporating underlying uncertainty, and improve resolution. The methods discussed here can also be applied to other major clades in the tree of life.
Collapse
Affiliation(s)
- Stephen A Smith
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, 48109, USA
| | - Joseph W Brown
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, 48109, USA
| |
Collapse
|
13
|
Antonelli A, Hettling H, Condamine FL, Vos K, Nilsson RH, Sanderson MJ, Sauquet H, Scharn R, Silvestro D, Töpel M, Bacon CD, Oxelman B, Vos RA. Toward a Self-Updating Platform for Estimating Rates of Speciation and Migration, Ages, and Relationships of Taxa. Syst Biol 2018; 66:152-166. [PMID: 27616324 PMCID: PMC5410925 DOI: 10.1093/sysbio/syw066] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2015] [Accepted: 07/19/2016] [Indexed: 01/06/2023] Open
Abstract
Rapidly growing biological data—including molecular sequences and fossils—hold an unprecedented potential to reveal how evolutionary processes generate and maintain biodiversity. However, researchers often have to develop their own idiosyncratic workflows to integrate and analyze these data for reconstructing time-calibrated phylogenies. In addition, divergence times estimated under different methods and assumptions, and based on data of various quality and reliability, should not be combined without proper correction. Here we introduce a modular framework termed SUPERSMART (Self-Updating Platform for Estimating Rates of Speciation and Migration, Ages, and Relationships of Taxa), and provide a proof of concept for dealing with the moving targets of evolutionary and biogeographical research. This framework assembles comprehensive data sets of molecular and fossil data for any taxa and infers dated phylogenies using robust species tree methods, also allowing for the inclusion of genomic data produced through next-generation sequencing techniques. We exemplify the application of our method by presenting phylogenetic and dating analyses for the mammal order Primates and for the plant family Arecaceae (palms). We believe that this framework will provide a valuable tool for a wide range of hypothesis-driven research questions in systematics, biogeography, and evolution. SUPERSMART will also accelerate the inference of a “Dated Tree of Life” where all node ages are directly comparable.
Collapse
Affiliation(s)
- Alexandre Antonelli
- Department of Biological and Environmental Sciences, University of Gothenburg, Box 461, SE-405 30 Göteborg, Sweden.,Gothenburg Botanical Garden, Carl Skottsbergs Gata 22A, SE-41319 Göteborg, Sweden
| | - Hannes Hettling
- Naturalis Biodiversity Center, Darwinweg 4, 2333 CR Leiden, The Netherlands
| | - Fabien L Condamine
- Department of Biological and Environmental Sciences, University of Gothenburg, Box 461, SE-405 30 Göteborg, Sweden.,CNRS, UMR 5554 Institut des Sciences de l'Evolution (Université de Montpellier), Place Eugéne Bataillon, 34095 Montpellier, France
| | - Karin Vos
- Department of Biological and Environmental Sciences, University of Gothenburg, Box 461, SE-405 30 Göteborg, Sweden
| | - R Henrik Nilsson
- Department of Biological and Environmental Sciences, University of Gothenburg, Box 461, SE-405 30 Göteborg, Sweden
| | - Michael J Sanderson
- Department of Ecology and Evolutionary Biology, University of Arizona, 1041 E. Lowell, Tucson, AZ 85721, USA
| | - Hervé Sauquet
- Université Paris-Sud, Laboratoire Écologie, Systématique, Évolution, CNRS UMR 8079, 91405 Orsay, France
| | - Ruud Scharn
- Department of Biological and Environmental Sciences, University of Gothenburg, Box 461, SE-405 30 Göteborg, Sweden
| | - Daniele Silvestro
- Department of Biological and Environmental Sciences, University of Gothenburg, Box 461, SE-405 30 Göteborg, Sweden.,Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland
| | - Mats Töpel
- Swedish Bioinformatics Infrastructure for Life Sciences, Department of Biological and Environmental Sciences, University of Gothenburg, Box 463, SE-405 30, Göteborg, Sweden.,Department of Marine Sciences, University of Gothenburg, Box 460, SE-405 30 Göteborg, Sweden
| | - Christine D Bacon
- Department of Biological and Environmental Sciences, University of Gothenburg, Box 461, SE-405 30 Göteborg, Sweden
| | - Bengt Oxelman
- Department of Biological and Environmental Sciences, University of Gothenburg, Box 461, SE-405 30 Göteborg, Sweden
| | - Rutger A Vos
- Naturalis Biodiversity Center, Darwinweg 4, 2333 CR Leiden, The Netherlands
| |
Collapse
|
14
|
Abstract
The human microbiome is the ensemble of genes in the microbes that live inside and on the surface of humans. Because microbial sequencing information is now much easier to come by than phenotypic information, there has been an explosion of sequencing and genetic analysis of microbiome samples. Much of the analytical work for these sequences involves phylogenetics, at least indirectly, but methodology has developed in a somewhat different direction than for other applications of phylogenetics. In this article, I review the field and its methods from the perspective of a phylogeneticist, as well as describing current challenges for phylogenetics coming from this type of work.
Collapse
Affiliation(s)
- Frederick A Matsen
- Program in Computational Biology, Fred Hutchinson Cancer Research Center, Seattle, WA 91802, USA
| |
Collapse
|