1
|
Lähteenaro M, Benda D, Straka J, Nylander JAA, Bergsten J. Phylogenomic analysis of Stylops reveals the evolutionary history of a Holarctic Strepsiptera radiation parasitizing wild bees. Mol Phylogenet Evol 2024; 195:108068. [PMID: 38554985 DOI: 10.1016/j.ympev.2024.108068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 03/07/2024] [Accepted: 03/24/2024] [Indexed: 04/02/2024]
Abstract
Holarctic Stylops is the largest genus of the enigmatic insect order Strepsiptera, twisted winged parasites. Members of Stylops are obligate endoparasites of Andrena mining bees and exhibit extreme sexual dimorphism typical of Strepsiptera. So far, molecular studies on Stylops have focused on questions on species delimitation. Here, we utilize the power of whole genome sequencing to infer the phylogeny of this morphologically challenging genus from thousands of loci. We use a species tree method, concatenated maximum likelihood analysis and Bayesian analysis with a relaxed clock model to reconstruct the phylogeny of 46 Stylops species, estimate divergence times, evaluate topological consistency across methods and infer the root position. Furthermore, the biogeographical history and coevolutionary patterns with host species are assessed. All methods recovered a well resolved topology with close to all nodes maximally supported and only a handful of minor topological variations. Based on the result, we find that included species can be divided into 12 species groups, seven of them including only Palaearctic species, three Nearctic and two were geographically mixed. We find a strongly supported root position between a clade formed by the spreta, thwaitesi and gwynanae species groups and the remaining species and that the sister group of Stylops is Eurystylops or Eurystylops + Kinzelbachus. Our results indicate that Stylops originated in the Western Palaearctic or Western Palaearctic and Nearctic in the early Neogene or late Paleogene, with four independent dispersal events to the Nearctic. Cophylogenetic analyses indicate that the diversification of Stylops has been shaped by both significant coevolution with the mining bee hosts and host-shifting. The well resolved and strongly supported phylogeny will provide a valuable phylogenetic basis for further studies into the fascinating world of Strepsipterans.
Collapse
Affiliation(s)
- Meri Lähteenaro
- Department of Zoology, Swedish Museum of Natural History, P. O. Box 50007, SE-104 05 Stockholm, Sweden; Department of Zoology, Faculty of Science, Stockholm University, SE-106 91 Stockholm, Sweden.
| | - Daniel Benda
- Department of Zoology, Faculty of Science, Charles University, Vinicna 7, CZ-128 44, Prague 2, Czech Republic; Department of Entomology, National Museum of the Czech Republic, Cirkusová 1740, CZ-19300 Prague 9, Czech Republic.
| | - Jakub Straka
- Department of Zoology, Faculty of Science, Charles University, Vinicna 7, CZ-128 44, Prague 2, Czech Republic.
| | - Johan A A Nylander
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, P.O. Box 50007, SE-106 91 Stockholm, Sweden.
| | - Johannes Bergsten
- Department of Zoology, Swedish Museum of Natural History, P. O. Box 50007, SE-104 05 Stockholm, Sweden; Department of Zoology, Faculty of Science, Stockholm University, SE-106 91 Stockholm, Sweden.
| |
Collapse
|
2
|
Xie N, Lin Y, Li P, Zhao J, Li J, Wang K, Yang L, Jia L, Wang Q, Li P, Song H. Simultaneous identification of DNA and RNA pathogens using metagenomic sequencing in cases of severe acute respiratory infection. J Med Virol 2024; 96:e29406. [PMID: 38373115 DOI: 10.1002/jmv.29406] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Revised: 01/03/2024] [Accepted: 01/04/2024] [Indexed: 02/21/2024]
Abstract
Metagenomic next-generation sequencing (mNGS) is a valuable technique for identifying pathogens. However, conventional mNGS requires the separate processing of DNA and RNA genomes, which can be resource- and time-intensive. To mitigate these impediments, we propose a novel method called DNA/RNA cosequencing that aims to enhance the efficiency of pathogen detection. DNA/RNA cosequencing uses reverse transcription of total nucleic acids extracted from samples by using random primers, without removing DNA, and then employs mNGS. We applied this method to 85 cases of severe acute respiratory infections (SARI). Influenza virus was identified in 13 cases (H1N1: seven cases, H3N2: three cases, unclassified influenza type: three cases) and was not detected in the remaining 72 samples. Bacteria were present in all samples. Pseudomonas aeruginosa, Klebsiella pneumoniae, and Acinetobacter baumannii were detected in four influenza-positive samples, suggesting coinfections. The sensitivity and specificity for detecting influenza A virus were 73.33% and 95.92%, respectively. A κ value of 0.726 indicated a high level of concordance between the results of DNA/RNA cosequencing and SARI influenza virus monitoring. DNA/RNA cosequencing enhanced the efficiency of pathogen detection, providing a novel capability to strengthen surveillance and thereby prevent and control infectious disease outbreaks.
Collapse
Affiliation(s)
- Nana Xie
- AnHui Medical University, Hefei, China
- Chinese PLA Center for Disease Control and Prevention of PLA, Beijing, China
| | - Yanfeng Lin
- Huadong Research Institute for Medicine and Biotechniques, Nanjing, China
| | - Peihan Li
- Chinese PLA Center for Disease Control and Prevention of PLA, Beijing, China
| | - Jiachen Zhao
- Beijing Center for Disease Prevention and Control, Beijing, China
| | - Jinhui Li
- Chinese PLA Center for Disease Control and Prevention of PLA, Beijing, China
| | - Kaiying Wang
- Chinese PLA Center for Disease Control and Prevention of PLA, Beijing, China
| | - Lang Yang
- Chinese PLA Center for Disease Control and Prevention of PLA, Beijing, China
| | - Leili Jia
- Chinese PLA Center for Disease Control and Prevention of PLA, Beijing, China
| | - Quanyi Wang
- Beijing Center for Disease Prevention and Control, Beijing, China
- Beijing Research Center for Respiratory Infectious Diseases, Beijing, China
| | - Peng Li
- AnHui Medical University, Hefei, China
- Chinese PLA Center for Disease Control and Prevention of PLA, Beijing, China
| | - Hongbin Song
- AnHui Medical University, Hefei, China
- Chinese PLA Center for Disease Control and Prevention of PLA, Beijing, China
| |
Collapse
|
3
|
Henríquez-Piskulich P, Hugall AF, Stuart-Fox D. A supermatrix phylogeny of the world's bees (Hymenoptera: Anthophila). Mol Phylogenet Evol 2024; 190:107963. [PMID: 37967640 DOI: 10.1016/j.ympev.2023.107963] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2023] [Revised: 10/28/2023] [Accepted: 11/04/2023] [Indexed: 11/17/2023]
Abstract
The increasing availability of large molecular phylogenies has provided new opportunities to study the evolution of species traits, their origins and diversification, and biogeography; yet there are limited attempts to synthesise existing phylogenetic information for major insect groups. Bees (Hymenoptera: Anthophila) are a large group of insect pollinators that have a worldwide distribution, and a wide variation in ecology, morphology, and life-history traits, including sociality. For these reasons, as well as their major economic importance as pollinators, numerous molecular phylogenetic studies of family and genus-level relationships have been published, providing an opportunity to assemble a bee 'tree-of-life'. We used publicly available genetic sequence data, including phylogenomic data, reconciled to a taxonomic database, to produce a concatenated supermatrix phylogeny for the Anthophila comprising 4,586 bee species, representing 23% of species and 82% of genera. At family, subfamily, and tribe levels, support for expected relationships was robust, but between and within some genera relationships remain uncertain. Within families, sampling of genera ranged from 67 to 100% but species coverage was lower (17-41%). Our phylogeny mostly reproduces the relationships found in recent phylogenomic studies with a few exceptions. We provide a summary of these differences and the current state of molecular data available and its gaps. We discuss the advantages and limitations of this bee supermatrix phylogeny (available online at beetreeoflife.org), which may enable new insights into long standing questions about evolutionary drivers in bees, and potentially insects more generally.
Collapse
Affiliation(s)
| | - Andrew F Hugall
- School of BioSciences, The University of Melbourne, Parkville, Victoria, Australia; Department of Sciences, Museums Victoria, Melbourne, Victoria, Australia.
| | - Devi Stuart-Fox
- School of BioSciences, The University of Melbourne, Parkville, Victoria, Australia
| |
Collapse
|
4
|
Wedell E, Cai Y, Warnow T. SCAMPP: Scaling Alignment-Based Phylogenetic Placement to Large Trees. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1417-1430. [PMID: 35471888 DOI: 10.1109/tcbb.2022.3170386] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Phylogenetic placement, the problem of placing a "query" sequence into a precomputed phylogenetic "backbone" tree, is useful for constructing large trees, performing taxon identification of newly obtained sequences, and other applications. The most accurate current methods, such as pplacer and EPA-ng, are based on maximum likelihood and require that the query sequence be provided within a multiple sequence alignment that includes the leaf sequences in the backbone tree. This approach enables high accuracy but also makes these likelihood-based methods computationally intensive on large backbone trees, and can even lead to them failing when the backbone trees are very large (e.g., having 50,000 or more leaves). We present SCAMPP (SCaling AlignMent-based Phylogenetic Placement), a technique to extend the scalability of these likelihood-based placement methods to ultra-large backbone trees. We show that pplacer-SCAMPP and EPA-ng-SCAMPP both scale well to ultra-large backbone trees (even up to 200,000 leaves), with accuracy that improves on APPLES and APPLES-2, two recently developed fast phylogenetic placement methods that scale to ultra-large datasets. EPA-ng-SCAMPP and pplacer-SCAMPP are available at https://github.com/chry04/PLUSplacer.
Collapse
|
5
|
Park M, Ivanovic S, Chu G, Shen C, Warnow T. UPP2: fast and accurate alignment of datasets with fragmentary sequences. Bioinformatics 2023; 39:6982552. [PMID: 36625535 PMCID: PMC9846425 DOI: 10.1093/bioinformatics/btad007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2022] [Revised: 12/01/2022] [Accepted: 01/09/2023] [Indexed: 01/11/2023] Open
Abstract
MOTIVATION Multiple sequence alignment (MSA) is a basic step in many bioinformatics pipelines. However, achieving highly accurate alignments on large datasets, especially those with sequence length heterogeneity, is a challenging task. Ultra-large multiple sequence alignment using Phylogeny-aware Profiles (UPP) is a method for MSA estimation that builds an ensemble of Hidden Markov Models (eHMM) to represent an estimated alignment on the full-length sequences in the input, and then adds the remaining sequences into the alignment using selected HMMs in the ensemble. Although UPP provides good accuracy, it is computationally intensive on large datasets. RESULTS We present UPP2, a direct improvement on UPP. The main advance is a fast technique for selecting HMMs in the ensemble that allows us to achieve the same accuracy as UPP but with greatly reduced runtime. We show that UPP2 produces more accurate alignments compared to leading MSA methods on datasets exhibiting substantial sequence length heterogeneity and is among the most accurate otherwise. AVAILABILITY AND IMPLEMENTATION https://github.com/gillichu/sepp. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Minhyuk Park
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61820, USA
| | - Stefan Ivanovic
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61820, USA
| | - Gillian Chu
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61820, USA
| | - Chengze Shen
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61820, USA
| | | |
Collapse
|
6
|
Park M, Warnow T. HMMerge: an ensemble method for multiple sequence alignment. BIOINFORMATICS ADVANCES 2023; 3:vbad052. [PMID: 37128578 PMCID: PMC10148686 DOI: 10.1093/bioadv/vbad052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Revised: 04/06/2023] [Accepted: 04/13/2023] [Indexed: 05/03/2023]
Abstract
Motivation Despite advances in method development for multiple sequence alignment over the last several decades, the alignment of datasets exhibiting substantial sequence length heterogeneity, especially when the input sequences include very short sequences (either as a result of sequencing technologies or of large deletions during evolution) remains an inadequately solved problem. Results We present HMMerge, a method to compute an alignment of datasets exhibiting high sequence length heterogeneity, or to add short sequences into a given 'backbone' alignment. HMMerge builds on the technique from its predecessor alignment methods, UPP and WITCH, which build an ensemble of profile HMMs to represent the backbone alignment and add the remaining sequences into the backbone alignment using the ensemble. HMMerge differs from UPP and WITCH by building a new 'merged' HMM from the ensemble, and then using that merged HMM to align the query sequences. We show that HMMerge is competitive with WITCH, with an advantage over WITCH when adding very short sequences into backbone alignments. Availability and implementation HMMerge is freely available at https://github.com/MinhyukPark/HMMerge. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Minhyuk Park
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | | |
Collapse
|
7
|
Zaharias P, Warnow T. Recent progress on methods for estimating and updating large phylogenies. Philos Trans R Soc Lond B Biol Sci 2022; 377:20210244. [PMID: 35989607 PMCID: PMC9393559 DOI: 10.1098/rstb.2021.0244] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 01/07/2022] [Indexed: 12/20/2022] Open
Abstract
With the increased availability of sequence data and even of fully sequenced and assembled genomes, phylogeny estimation of very large trees (even of hundreds of thousands of sequences) is now a goal for some biologists. Yet, the construction of these phylogenies is a complex pipeline presenting analytical and computational challenges, especially when the number of sequences is very large. In the past few years, new methods have been developed that aim to enable highly accurate phylogeny estimations on these large datasets, including divide-and-conquer techniques for multiple sequence alignment and/or tree estimation, methods that can estimate species trees from multi-locus datasets while addressing heterogeneity due to biological processes (e.g. incomplete lineage sorting and gene duplication and loss), and methods to add sequences into large gene trees or species trees. Here we present some of these recent advances and discuss opportunities for future improvements. This article is part of a discussion meeting issue 'Genomic population structures of microbial pathogens'.
Collapse
Affiliation(s)
- Paul Zaharias
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
8
|
Shen C, Park M, Warnow T. WITCH: Improved Multiple Sequence Alignment Through Weighted Consensus Hidden Markov Model Alignment. J Comput Biol 2022; 29:782-801. [PMID: 35575747 DOI: 10.1089/cmb.2021.0585] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
Accurate multiple sequence alignment is challenging on many data sets, including those that are large, evolve under high rates of evolution, or have sequence length heterogeneity. While substantial progress has been made over the last decade in addressing the first two challenges, sequence length heterogeneity remains a significant issue for many data sets. Sequence length heterogeneity occurs for biological and technological reasons, including large insertions or deletions (indels) that occurred in the evolutionary history relating the sequences, or the inclusion of sequences that are not fully assembled. Ultra-large alignments using Phylogeny-Aware Profiles (UPP) (Nguyen et al. 2015) is one of the most accurate approaches for aligning data sets that exhibit sequence length heterogeneity: it constructs an alignment on the subset of sequences it considers "full-length," represents this "backbone alignment" using an ensemble of hidden Markov models (HMMs), and then adds each remaining sequence into the backbone alignment based on an HMM selected for that sequence from the ensemble. Our new method, WeIghTed Consensus Hmm alignment (WITCH), improves on UPP in three important ways: first, it uses a statistically principled technique to weight and rank the HMMs; second, it uses k>1 HMMs from the ensemble rather than a single HMM; and third, it combines the alignments for each of the selected HMMs using a consensus algorithm that takes the weights into account. We show that this approach provides improved alignment accuracy compared with UPP and other leading alignment methods, as well as improved accuracy for maximum likelihood trees based on these alignments.
Collapse
Affiliation(s)
- Chengze Shen
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| | - Minhyuk Park
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| |
Collapse
|
9
|
Zhu Q, Mirarab S. Assembling a Reference Phylogenomic Tree of Bacteria and Archaea by Summarizing Many Gene Phylogenies. Methods Mol Biol 2022; 2569:137-165. [PMID: 36083447 DOI: 10.1007/978-1-0716-2691-7_7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Phylogenomics is the inference of phylogenetic trees based on multiple marker genes sampled in the genomes of interest. An important challenge in phylogenomics is the potential incongruence among the evolutionary histories of individual genes, which can be widespread in microorganisms due to the prevalence of horizontal gene transfer. This protocol introduces the procedures for building a phylogenetic tree of a large number of microbial genomes using a broad sampling of marker genes that are representative of whole-genome evolution. The protocol highlights the use of a gene tree summary method, which can effectively reconstruct the species tree while accounting for the topological conflicts among individual gene trees. The pipeline described in this protocol is scalable to tens of thousands of genomes while retaining high accuracy. We discussed multiple software tools, libraries, and scripts to enable convenient adoption of the protocol. The protocol is suitable for microbiology and microbiome studies based on public genomes and metagenomic data.
Collapse
Affiliation(s)
- Qiyun Zhu
- Biodesign Center for Fundamental and Applied Microbiomics, Arizona State University, Tempe, AZ, USA.
- School of Life Sciences, Arizona State University, Tempe, AZ, USA.
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, CA, USA
| |
Collapse
|
10
|
Diversity of Land Snail Tribe Helicini (Gastropoda: Stylommatophora: Helicidae): Where Do We Stand after 20 Years of Sequencing Mitochondrial Markers? DIVERSITY 2021. [DOI: 10.3390/d14010024] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Sequences of mitochondrial genes revolutionized the understanding of animal diversity and continue to be an important tool in biodiversity research. In the tribe Helicini, a prominent group of the western Palaearctic land snail fauna, mitochondrial data accumulating since the 2000s helped to newly delimit genera, inform species-level taxonomy and reconstruct past range dynamics. We combined the published data with own unpublished sequences and provide a detailed overview of what they revealed about the diversity of the group. The delimitation of Helix is revised by placing Helix godetiana back in the genus and new synonymies are suggested within the genera Codringtonia and Helix. The spatial distribution of intraspecific mitochondrial lineages of several species is shown for the first time. Comparisons between species reveal considerable variation in distribution patterns of intraspecific lineages, from broad postglacial distributions to regions with a fine-scale pattern of allopatric lineage replacement. To provide a baseline for further research and information for anyone re-using the data, we thoroughly discuss the gaps in the current dataset, focusing on both taxonomic and geographic coverage. Thanks to the wealth of data already amassed and the relative ease with which they can be obtained, mitochondrial sequences remain an important source of information on intraspecific diversity over large areas and taxa.
Collapse
|
11
|
Shen C, Zaharias P, Warnow T. MAGUS+eHMMs: improved multiple sequence alignment accuracy for fragmentary sequences. Bioinformatics 2021; 38:918-924. [PMID: 34791036 PMCID: PMC8796358 DOI: 10.1093/bioinformatics/btab788] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2021] [Revised: 10/14/2021] [Accepted: 11/12/2021] [Indexed: 02/03/2023] Open
Abstract
SUMMARY Multiple sequence alignment is an initial step in many bioinformatics pipelines, including phylogeny estimation, protein structure prediction and taxonomic identification of reads produced in amplicon or metagenomic datasets, etc. Yet, alignment estimation is challenging on datasets that exhibit substantial sequence length heterogeneity, and especially when the datasets have fragmentary sequences as a result of including reads or contigs generated by next-generation sequencing technologies. Here, we examine techniques that have been developed to improve alignment estimation when datasets contain substantial numbers of fragmentary sequences. We find that MAGUS, a recently developed MSA method, is fairly robust to fragmentary sequences under many conditions, and that using a two-stage approach where MAGUS is used to align selected 'backbone sequences' and the remaining sequences are added into the alignment using ensembles of Hidden Markov Models further improves alignment accuracy. The combination of MAGUS with the ensemble of eHMMs (i.e. MAGUS+eHMMs) clearly improves on UPP, the previous leading method for aligning datasets with high levels of fragmentation. AVAILABILITY AND IMPLEMENTATION UPP is available on https://github.com/smirarab/sepp, and MAGUS is available on https://github.com/vlasmirnov/MAGUS. MAGUS+eHMMs can be performed by running MAGUS to obtain the backbone alignment, and then using the backbone alignment as an input to UPP. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chengze Shen
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Paul Zaharias
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | | |
Collapse
|
12
|
Wertheim JO, Steel M, Sanderson MJ. Accuracy in near-perfect virus phylogenies. Syst Biol 2021; 71:426-438. [PMID: 34398231 PMCID: PMC8385947 DOI: 10.1093/sysbio/syab069] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Revised: 08/05/2021] [Accepted: 08/11/2021] [Indexed: 11/26/2022] Open
Abstract
Phylogenetic trees from real-world data often include short edges with very few substitutions per site, which can lead to partially resolved trees and poor accuracy. Theory indicates that the number of sites needed to accurately reconstruct a fully resolved tree grows at a rate proportional to the inverse square of the length of the shortest edge. However, when inferred trees are partially resolved due to short edges, “accuracy” should be defined as the rate of discovering false splits (clades on a rooted tree) relative to the actual number found. Thus, accuracy can be high even if short edges are common. Specifically, in a “near-perfect” parameter space in which trees are large, the tree length ξ (the sum of all edge lengths) is small, and rate variation is minimal, the expected false positive rate is less than ξ∕3; the exact value depends on tree shape and sequence length. This expected false positive rate is far below the false negative rate for small ξ and often well below 5% even when some assumptions are relaxed. We show this result analytically for maximum parsimony and explore its extension to maximum likelihood using theory and simulations. For hypothesis testing, we show that measures of split “support” that rely on bootstrap resampling consistently imply weaker support than that implied by the false positive rates in near-perfect trees. The near-perfect parameter space closely fits several empirical studies of human virus diversification during outbreaks and epidemics, including Ebolavirus, Zika virus, and SARS-CoV-2, reflecting low substitution rates relative to high transmission/sampling rates in these viruses.
Collapse
Affiliation(s)
- Joel O Wertheim
- Department of Medicine, University of California San Diego, La Jolla, CA 92093, USA
| | - Mike Steel
- Biomathematics Research Center, School of Mathematics and Statistics, University of Canterbury, Christchurch, 8041, New Zealand
| | - Michael J Sanderson
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721 USA
| |
Collapse
|
13
|
Zhang C, Zhao Y, Braun EL, Mirarab S. TAPER: Pinpointing errors in multiple sequence alignments despite varying rates of evolution. Methods Ecol Evol 2021. [DOI: 10.1111/2041-210x.13696] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Affiliation(s)
- Chao Zhang
- Bioinformatics and Systems Biology Program University of California San Diego CA USA
| | - Yiming Zhao
- Electrical and Computer Engineering Department University of California San Diego CA USA
| | - Edward L. Braun
- Department of Biology and Genetics Institute University of Florida Gainesville FL USA
| | - Siavash Mirarab
- Electrical and Computer Engineering Department University of California San Diego CA USA
| |
Collapse
|
14
|
Gupta M, Zaharias P, Warnow T. Accurate Large-scale Phylogeny-Aware Alignment using BAli-Phy. Bioinformatics 2021; 37:4677-4683. [PMID: 34320635 DOI: 10.1093/bioinformatics/btab555] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Revised: 06/25/2021] [Accepted: 07/27/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION BAli-Phy, a popular Bayesian method that co-estimates multiple sequence alignments and phylogenetic trees, is a rigorous statistical method, but due to its computational requirements, it has generally been limited to relatively small datasets (at most about 100 sequences). Here we repurpose BAli-Phy as a ``phylogeny-aware" alignment method: we estimate the phylogeny from the input of unaligned sequences, and then use that as a fixed tree within BAli-Phy. RESULTS We show that this approach achieves high accuracy, greatly superior to Prank, the current most popular phylogeny-aware alignment method, and is even more accurate than MAFFT, one of the top performing alignment methods in common use. Furthermore, this approach can be used to align very large datasets (up to 1000 sequences in this study). AVAILABILITY See https://doi.org/10.13012/B2IDB-7863273_V1 for datasets used in this study. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Maya Gupta
- 1University of Illinois Urbana-Champaign, Urbana IL 61801, USA
| | - Paul Zaharias
- 1University of Illinois Urbana-Champaign, Urbana IL 61801, USA
| | - Tandy Warnow
- 1University of Illinois Urbana-Champaign, Urbana IL 61801, USA
| |
Collapse
|
15
|
Abstract
The estimation of phylogenetic trees for individual genes or multi-locus datasets is a basic part of considerable biological research. In order to enable large trees to be computed, Disjoint Tree Mergers (DTMs) have been developed; these methods operate by dividing the input sequence dataset into disjoint sets, constructing trees on each subset, and then combining the subset trees (using auxiliary information) into a tree on the full dataset. DTMs have been used to advantage for multi-locus species tree estimation, enabling highly accurate species trees at reduced computational effort, compared to leading species tree estimation methods. Here, we evaluate the feasibility of using DTMs to improve the scalability of maximum likelihood (ML) gene tree estimation to large numbers of input sequences. Our study shows distinct differences between the three selected ML codes—RAxML-NG, IQ-TREE 2, and FastTree 2—and shows that good DTM pipeline design can provide advantages over these ML codes on large datasets.
Collapse
|