1
|
Wertheim JO, Steel M, Sanderson MJ. Accuracy in near-perfect virus phylogenies. Syst Biol 2021; 71:426-438. [PMID: 34398231 PMCID: PMC8385947 DOI: 10.1093/sysbio/syab069] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Revised: 08/05/2021] [Accepted: 08/11/2021] [Indexed: 11/26/2022] Open
Abstract
Phylogenetic trees from real-world data often include short edges with very few substitutions per site, which can lead to partially resolved trees and poor accuracy. Theory indicates that the number of sites needed to accurately reconstruct a fully resolved tree grows at a rate proportional to the inverse square of the length of the shortest edge. However, when inferred trees are partially resolved due to short edges, “accuracy” should be defined as the rate of discovering false splits (clades on a rooted tree) relative to the actual number found. Thus, accuracy can be high even if short edges are common. Specifically, in a “near-perfect” parameter space in which trees are large, the tree length ξ (the sum of all edge lengths) is small, and rate variation is minimal, the expected false positive rate is less than ξ∕3; the exact value depends on tree shape and sequence length. This expected false positive rate is far below the false negative rate for small ξ and often well below 5% even when some assumptions are relaxed. We show this result analytically for maximum parsimony and explore its extension to maximum likelihood using theory and simulations. For hypothesis testing, we show that measures of split “support” that rely on bootstrap resampling consistently imply weaker support than that implied by the false positive rates in near-perfect trees. The near-perfect parameter space closely fits several empirical studies of human virus diversification during outbreaks and epidemics, including Ebolavirus, Zika virus, and SARS-CoV-2, reflecting low substitution rates relative to high transmission/sampling rates in these viruses.
Collapse
Affiliation(s)
- Joel O Wertheim
- Department of Medicine, University of California San Diego, La Jolla, CA 92093, USA
| | - Mike Steel
- Biomathematics Research Center, School of Mathematics and Statistics, University of Canterbury, Christchurch, 8041, New Zealand
| | - Michael J Sanderson
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721 USA
| |
Collapse
|
2
|
Maldonado E, Antunes A. LMAP_S: Lightweight Multigene Alignment and Phylogeny eStimation. BMC Bioinformatics 2019; 20:739. [PMID: 31888452 PMCID: PMC6937843 DOI: 10.1186/s12859-019-3292-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2019] [Accepted: 11/26/2019] [Indexed: 01/22/2023] Open
Abstract
Background Recent advances in genome sequencing technologies and the cost drop in high-throughput sequencing continue to give rise to a deluge of data available for downstream analyses. Among others, evolutionary biologists often make use of genomic data to uncover phenotypic diversity and adaptive evolution in protein-coding genes. Therefore, multiple sequence alignments (MSA) and phylogenetic trees (PT) need to be estimated with optimal results. However, the preparation of an initial dataset of multiple sequence file(s) (MSF) and the steps involved can be challenging when considering extensive amount of data. Thus, it becomes necessary the development of a tool that removes the potential source of error and automates the time-consuming steps of a typical workflow with high-throughput and optimal MSA and PT estimations. Results We introduce LMAP_S (Lightweight Multigene Alignment and Phylogeny eStimation), a user-friendly command-line and interactive package, designed to handle an improved alignment and phylogeny estimation workflow: MSF preparation, MSA estimation, outlier detection, refinement, consensus, phylogeny estimation, comparison and editing, among which file and directory organization, execution, manipulation of information are automated, with minimal manual user intervention. LMAP_S was developed for the workstation multi-core environment and provides a unique advantage for processing multiple datasets. Our software, proved to be efficient throughout the workflow, including, the (unlimited) handling of more than 20 datasets. Conclusions We have developed a simple and versatile LMAP_S package enabling researchers to effectively estimate multiple datasets MSAs and PTs in a high-throughput fashion. LMAP_S integrates more than 25 software providing overall more than 65 algorithm choices distributed in five stages. At minimum, one FASTA file is required within a single input directory. To our knowledge, no other software combines MSA and phylogeny estimation with as many alternatives and provides means to find optimal MSAs and phylogenies. Moreover, we used a case study comparing methodologies that highlighted the usefulness of our software. LMAP_S has been developed as an open-source package, allowing its integration into more complex open-source bioinformatics pipelines. LMAP_S package is released under GPLv3 license and is freely available at https://lmap-s.sourceforge.io/.
Collapse
Affiliation(s)
- Emanuel Maldonado
- CIIMAR/CIMAR - Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de Matos, s/n, 4450-208, Porto, Portugal
| | - Agostinho Antunes
- CIIMAR/CIMAR - Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de Matos, s/n, 4450-208, Porto, Portugal. .,Department of Biology, Faculty of Sciences, University of Porto, Rua do Campo Alegre, 4169-007, Porto, Portugal.
| |
Collapse
|
3
|
Matsui M, Iwasaki W. Graph Splitting: A Graph-Based Approach for Superfamily-Scale Phylogenetic Tree Reconstruction. Syst Biol 2019; 69:265-279. [DOI: 10.1093/sysbio/syz049] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2018] [Revised: 07/09/2019] [Accepted: 07/20/2019] [Indexed: 11/12/2022] Open
Abstract
Abstract
A protein superfamily contains distantly related proteins that have acquired diverse biological functions through a long evolutionary history. Phylogenetic analysis of the early evolution of protein superfamilies is a key challenge because existing phylogenetic methods show poor performance when protein sequences are too diverged to construct an informative multiple sequence alignment (MSA). Here, we propose the Graph Splitting (GS) method, which rapidly reconstructs a protein superfamily-scale phylogenetic tree using a graph-based approach. Evolutionary simulation showed that the GS method can accurately reconstruct phylogenetic trees and be robust to major problems in phylogenetic estimation, such as biased taxon sampling, heterogeneous evolutionary rates, and long-branch attraction when sequences are substantially diverge. Its application to an empirical data set of the triosephosphate isomerase (TIM)-barrel superfamily suggests rapid evolution of protein-mediated pyrimidine biosynthesis, likely taking place after the RNA world. Furthermore, the GS method can also substantially improve performance of widely used MSA methods by providing accurate guide trees.
Collapse
Affiliation(s)
- Motomu Matsui
- Department of Biological Sciences, Graduate School of Science, The University of Tokyo, 2-11-16 Yayoi, Bunkyo-ku, Tokyo 113-0032, Japan
| | - Wataru Iwasaki
- Department of Biological Sciences, Graduate School of Science, The University of Tokyo, 2-11-16 Yayoi, Bunkyo-ku, Tokyo 113-0032, Japan
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8568, Japan
- Atmosphere and Ocean Research Institute, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8564, Japan
| |
Collapse
|
4
|
Wright ES. DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment. BMC Bioinformatics 2015; 16:322. [PMID: 26445311 PMCID: PMC4595117 DOI: 10.1186/s12859-015-0749-z] [Citation(s) in RCA: 213] [Impact Index Per Article: 23.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2015] [Accepted: 09/23/2015] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Alignment of large and diverse sequence sets is a common task in biological investigations, yet there remains considerable room for improvement in alignment quality. Multiple sequence alignment programs tend to reach maximal accuracy when aligning only a few sequences, and then diminish steadily as more sequences are added. This drop in accuracy can be partly attributed to a build-up of error and ambiguity as more sequences are aligned. Most high-throughput sequence alignment algorithms do not use contextual information under the assumption that sites are independent. This study examines the extent to which local sequence context can be exploited to improve the quality of large multiple sequence alignments. RESULTS Two predictors based on local sequence context were assessed: (i) single sequence secondary structure predictions, and (ii) modulation of gap costs according to the surrounding residues. The results indicate that context-based predictors have appreciable information content that can be utilized to create more accurate alignments. Furthermore, local context becomes more informative as the number of sequences increases, enabling more accurate protein alignments of large empirical benchmarks. These discoveries became the basis for DECIPHER, a new context-aware program for sequence alignment, which outperformed other programs on large sequence sets. CONCLUSIONS Predicting secondary structure based on local sequence context is an efficient means of breaking the independence assumption in alignment. Since secondary structure is more conserved than primary sequence, it can be leveraged to improve the alignment of distantly related proteins. Moreover, secondary structure predictions increase in accuracy as more sequences are used in the prediction. This enables the scalable generation of large sequence alignments that maintain high accuracy even on diverse sequence sets. The DECIPHER R package and source code are freely available for download at DECIPHER.cee.wisc.edu and from the Bioconductor repository.
Collapse
Affiliation(s)
- Erik S Wright
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, 53715, USA. .,Wisconsin Institute for Discovery, University of Wisconsin-Madison, 330 N. Orchard St., Madison, WI, 53715, USA.
| |
Collapse
|
5
|
Abstract
In the tree reconciliation approach for species tree inference, a tree that has the minimum reconciliation score for given gene trees is taken as an estimate of the species tree. The scoring models used in existing tree reconciliation methods include the duplication, mutation, and deep coalescence costs. Since existing inference methods all are heuristic, their performances are often evaluated by using the Robinson-Foulds (RF) distance between the true species trees and the estimates output on simulated multi-locus datasets. To better understand these methods, we study the relationships between the duplication cost and the RF distance. We prove that the gap between the duplication cost and the RF distance is unbounded, but the symmetric duplication cost is logarithmically equivalent to the RF distance. The relationships between other reconciliation costs and the RF distance are also investigated.
Collapse
Affiliation(s)
- Yu Zheng
- Department of Mathematics, National University of Singapore, Singapore
| | - Louxin Zhang
- Department of Mathematics, National University of Singapore, Singapore
- Graduate School for Integrative Sciences and Engineering, National University of Singapore, Singapore
| |
Collapse
|
6
|
Abstract
This article reviews the various models that have been used to describe the relationships between gene trees and species trees. Molecular phylogeny has focused mainly on improving models for the reconstruction of gene trees based on sequence alignments. Yet, most phylogeneticists seek to reveal the history of species. Although the histories of genes and species are tightly linked, they are seldom identical, because genes duplicate, are lost or horizontally transferred, and because alleles can coexist in populations for periods that may span several speciation events. Building models describing the relationship between gene and species trees can thus improve the reconstruction of gene trees when a species tree is known, and vice versa. Several approaches have been proposed to solve the problem in one direction or the other, but in general neither gene trees nor species trees are known. Only a few studies have attempted to jointly infer gene trees and species trees. These models account for gene duplication and loss, transfer or incomplete lineage sorting. Some of them consider several types of events together, but none exists currently that considers the full repertoire of processes that generate gene trees along the species tree. Simulations as well as empirical studies on genomic data show that combining gene tree-species tree models with models of sequence evolution improves gene tree reconstruction. In turn, these better gene trees provide a more reliable basis for studying genome evolution or reconstructing ancestral chromosomes and ancestral gene sequences. We predict that gene tree-species tree methods that can deal with genomic data sets will be instrumental to advancing our understanding of genomic evolution.
Collapse
Affiliation(s)
- Gergely J Szöllősi
- ELTE-MTA "Lendület" Biophysics Research Group, Pázmány P. stny. 1A., 1117 Budapest, Hungary; Laboratoire de Biométrie et Biologie Evolutive, Centre National de la Recherche Scientifique, Unité Mixte de Recherche 5558, Université Lyon 1, F-69622 Villeurbanne, France; Université de Lyon, F-69000 Lyon, France; and Institut National de Recherche en Informatique et en Automatique Rhône-Alpes, F-38334 Montbonnot, France
| | - Eric Tannier
- ELTE-MTA "Lendület" Biophysics Research Group, Pázmány P. stny. 1A., 1117 Budapest, Hungary; Laboratoire de Biométrie et Biologie Evolutive, Centre National de la Recherche Scientifique, Unité Mixte de Recherche 5558, Université Lyon 1, F-69622 Villeurbanne, France; Université de Lyon, F-69000 Lyon, France; and Institut National de Recherche en Informatique et en Automatique Rhône-Alpes, F-38334 Montbonnot, France; ELTE-MTA "Lendület" Biophysics Research Group, Pázmány P. stny. 1A., 1117 Budapest, Hungary; Laboratoire de Biométrie et Biologie Evolutive, Centre National de la Recherche Scientifique, Unité Mixte de Recherche 5558, Université Lyon 1, F-69622 Villeurbanne, France; Université de Lyon, F-69000 Lyon, France; and Institut National de Recherche en Informatique et en Automatique Rhône-Alpes, F-38334 Montbonnot, France; ELTE-MTA "Lendület" Biophysics Research Group, Pázmány P. stny. 1A., 1117 Budapest, Hungary; Laboratoire de Biométrie et Biologie Evolutive, Centre National de la Recherche Scientifique, Unité Mixte de Recherche 5558, Université Lyon 1, F-69622 Villeurbanne, France; Université de Lyon, F-69000 Lyon, France; and Institut National de Recherche en Informatique et en Automatique Rhône-Alpes, F-38334 Montbonnot, France
| | - Vincent Daubin
- ELTE-MTA "Lendület" Biophysics Research Group, Pázmány P. stny. 1A., 1117 Budapest, Hungary; Laboratoire de Biométrie et Biologie Evolutive, Centre National de la Recherche Scientifique, Unité Mixte de Recherche 5558, Université Lyon 1, F-69622 Villeurbanne, France; Université de Lyon, F-69000 Lyon, France; and Institut National de Recherche en Informatique et en Automatique Rhône-Alpes, F-38334 Montbonnot, France; ELTE-MTA "Lendület" Biophysics Research Group, Pázmány P. stny. 1A., 1117 Budapest, Hungary; Laboratoire de Biométrie et Biologie Evolutive, Centre National de la Recherche Scientifique, Unité Mixte de Recherche 5558, Université Lyon 1, F-69622 Villeurbanne, France; Université de Lyon, F-69000 Lyon, France; and Institut National de Recherche en Informatique et en Automatique Rhône-Alpes, F-38334 Montbonnot, France
| | - Bastien Boussau
- ELTE-MTA "Lendület" Biophysics Research Group, Pázmány P. stny. 1A., 1117 Budapest, Hungary; Laboratoire de Biométrie et Biologie Evolutive, Centre National de la Recherche Scientifique, Unité Mixte de Recherche 5558, Université Lyon 1, F-69622 Villeurbanne, France; Université de Lyon, F-69000 Lyon, France; and Institut National de Recherche en Informatique et en Automatique Rhône-Alpes, F-38334 Montbonnot, France; ELTE-MTA "Lendület" Biophysics Research Group, Pázmány P. stny. 1A., 1117 Budapest, Hungary; Laboratoire de Biométrie et Biologie Evolutive, Centre National de la Recherche Scientifique, Unité Mixte de Recherche 5558, Université Lyon 1, F-69622 Villeurbanne, France; Université de Lyon, F-69000 Lyon, France; and Institut National de Recherche en Informatique et en Automatique Rhône-Alpes, F-38334 Montbonnot, France;
| |
Collapse
|