1
|
Watanabe T, Nakata S, Horiike T. Influence of Long-Branch Bias on Phylogenetic Analysis. Genes Genet Syst 2025:24-00151. [PMID: 39956612 DOI: 10.1266/ggs.24-00151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/18/2025] Open
Abstract
In phylogenetic analysis, long-branch attraction (LBA) occurs when two distantly related species with longer branches are mistakenly grouped as the most closely related species. Previous research addressing this issue has focused on phylogenetic trees with four operational taxonomic units and three topologies, using two models: the Felsenstein model tree, which has two long branches that are not closely related, and the Farris tree, which has two long branches that are most closely related. For the Felsenstein model, the maximum parsimony method is more prone to estimating incorrect tree shapes compared to the maximum likelihood (ML) method, whereas in the Farris model, the opposite tendency is observed. However, the underlying reason for these differences remains unclear. Therefore, we inferred phylogenetic trees using sequence data from molecular evolution simulations of model phylogenetic trees with different long-branch lengths and measured the tree shapes and branch lengths of the obtained phylogenetic trees. Our findings revealed that the tree inference bias caused by the presence of long branches (defined as the "long-branch bias") increases with the accumulation of mutations, and influences all model tree or phylogenetic inference method. In other words, in Felsenstein tree models, methods that are highly sensitive to the long-branch bias tend to cause LBA, and in Farris tree models, the methods tend to infer apparently correct phylogenetic trees because of this influence. Thus, methods sensitive to the long-branch bias always infer the same tree shape. Additionally, the long-branch bias causes similar misestimations of branch lengths in both Felsenstein and Farris trees inferred by neighbor-joining or ML. This insight into long-branch bias will lead to a more reliable interpretation of phylogenetic trees, such as the shift of branching points, improving the accuracy of future research in molecular evolution.
Collapse
Affiliation(s)
- Tomoaki Watanabe
- Institute of Fruit Tree and Tea Science, National Agriculture and Food Research Organization
- United Graduate School of Agricultural Science, Gifu University
| | - Shohei Nakata
- Graduate School of Integrated Science and Technology, Shizuoka University
| | | |
Collapse
|
2
|
Zhu Y, Li Y, Li C, Shen XX, Zhou X. A critical evaluation of deep-learning based phylogenetic inference programs using simulated datasets. J Genet Genomics 2025:S1673-8527(25)00021-9. [PMID: 39824436 DOI: 10.1016/j.jgg.2025.01.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2024] [Revised: 01/08/2025] [Accepted: 01/09/2025] [Indexed: 01/20/2025]
Affiliation(s)
- Yixiao Zhu
- College of Agriculture and Biotechnology and Centre for Evolutionary & Organismal Biology, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Yonglin Li
- Guangdong Laboratory for Lingnan Modern Agriculture, Guangdong Province Key Laboratory of Microbial Signals and Disease Control, Integrative Microbiology Research Centre, South China Agricultural University, Guangzhou, Guangdong 510642, China
| | - Chuhao Li
- Guangdong Laboratory for Lingnan Modern Agriculture, Guangdong Province Key Laboratory of Microbial Signals and Disease Control, Integrative Microbiology Research Centre, South China Agricultural University, Guangzhou, Guangdong 510642, China
| | - Xing-Xing Shen
- College of Agriculture and Biotechnology and Centre for Evolutionary & Organismal Biology, Zhejiang University, Hangzhou, Zhejiang 310058, China.
| | - Xiaofan Zhou
- Guangdong Laboratory for Lingnan Modern Agriculture, Guangdong Province Key Laboratory of Microbial Signals and Disease Control, Integrative Microbiology Research Centre, South China Agricultural University, Guangzhou, Guangdong 510642, China.
| |
Collapse
|
3
|
Kulikov N, Derakhshandeh F, Mayer C. Machine learning can be as good as maximum likelihood when reconstructing phylogenetic trees and determining the best evolutionary model on four taxon alignments. Mol Phylogenet Evol 2024; 200:108181. [PMID: 39209046 DOI: 10.1016/j.ympev.2024.108181] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 07/15/2024] [Accepted: 08/26/2024] [Indexed: 09/04/2024]
Abstract
Phylogenetic tree reconstruction with molecular data is important in many fields of life science research. The gold standard in this discipline is the phylogenetic tree reconstruction based on the Maximum Likelihood method. In this study, we present neural networks to predict the best model of sequence evolution and the correct topology for four sequence alignments of nucleotide or amino acid sequence data. We trained neural networks with different architectures using simulated alignments for a wide range of evolutionary models, model parameters and branch lengths. By comparing the accuracy of model and topology prediction of the trained neural networks with Maximum Likelihood and Neighbour Joining methods, we show that for quartet trees, the neural network classifier outperforms the Neighbour Joining method and is in most cases as good as the Maximum Likelihood method to infer the best model of sequence evolution and the best tree topology. These results are consistent for nucleotide and amino acid sequence data. We also show that our method is superior for model selection than previously published methods based on convolutionary networks. Furthermore, we found that neural network classifiers are much faster than the IQ-TREE implementation of the Maximum Likelihood method. Our results show that neural networks could become a true competitor for the Maximum Likelihood method in phylogenetic reconstructions.
Collapse
Affiliation(s)
- Nikita Kulikov
- Molecular Evolutionary Biology, Department of Biology, Hamburg University, Germany; Leibniz Institute for the Analysis of Biodiversity Change (LIB), Germany.
| | - Fatemeh Derakhshandeh
- Leibniz Institute for the Analysis of Biodiversity Change (LIB), Germany; Medical Faculty, Heidelberg University, Germany
| | - Christoph Mayer
- Leibniz Institute for the Analysis of Biodiversity Change (LIB), Germany
| |
Collapse
|
4
|
Wiegert J, Höhler D, Haag J, Stamatakis A. Predicting Phylogenetic Bootstrap Values via Machine Learning. Mol Biol Evol 2024; 41:msae215. [PMID: 39418337 PMCID: PMC11523138 DOI: 10.1093/molbev/msae215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Revised: 08/28/2024] [Accepted: 09/15/2024] [Indexed: 10/19/2024] Open
Abstract
Estimating the statistical robustness of the inferred tree(s) constitutes an integral part of most phylogenetic analyses. Commonly, one computes and assigns a branch support value to each inner branch of the inferred phylogeny. The still most widely used method for calculating branch support on trees inferred under maximum likelihood (ML) is the Standard, nonparametric Felsenstein bootstrap support (SBS). Due to the high computational cost of the SBS, a plethora of methods has been developed to approximate it, for instance, via the rapid bootstrap (RB) algorithm. There have also been attempts to devise faster, alternative support measures, such as the SH-aLRT (Shimodaira-Hasegawa-like approximate likelihood ratio test) or the UltraFast bootstrap 2 (UFBoot2) method. Those faster alternatives exhibit some limitations, such as the need to assess model violations (UFBoot2) or unstable behavior in the low support interval range (SH-aLRT). Here, we present the educated bootstrap guesser (EBG), a machine learning-based tool that predicts SBS branch support values for a given input phylogeny. EBG is on average 9.4 (σ=5.5) times faster than UFBoot2. EBG-based SBS estimates exhibit a median absolute error of 5 when predicting SBS values between 0 and 100. Furthermore, EBG also provides uncertainty measures for all per-branch SBS predictions and thereby allows for a more rigorous and careful interpretation. EBG can, for instance, predict SBS support values on a phylogeny comprising 1,654 SARS-CoV2 genome sequences within 3 h on a mid-class laptop. EBG is available under GNU GPL3.
Collapse
Affiliation(s)
- Julius Wiegert
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Dimitri Höhler
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Julia Haag
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Biodiversity Computing Group, Institute of Computer Science, Foundation for Research and Technology - Hellas, Heraklion, Crete, Greece
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| |
Collapse
|
5
|
Yang B, Zhou X, Liu S. Tracing the genealogy origin of geographic populations based on genomic variation and deep learning. Mol Phylogenet Evol 2024; 198:108142. [PMID: 38964594 DOI: 10.1016/j.ympev.2024.108142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Revised: 05/30/2024] [Accepted: 07/01/2024] [Indexed: 07/06/2024]
Abstract
Assigning a query individual animal or plant to its derived population is a prime task in diverse applications related to organismal genealogy. Such endeavors have conventionally relied on short DNA sequences under a phylogenetic framework. These methods naturally show constraints when the inferred population sources are ambiguously phylogenetically structured, a scenario demanding substantially more informative genetic signals. Recent advances in cost-effective production of whole-genome sequences and artificial intelligence have created an unprecedented opportunity to trace the population origin for essentially any given individual, as long as the genome reference data are comprehensive and standardized. Here, we developed a convolutional neural network method to identify population origins using genomic SNPs. Three empirical datasets (an Asian honeybee, a red fire ant, and a chicken datasets) and two simulated populations are used for the proof of concepts. The performance tests indicate that our method can accurately identify the genealogy origin of query individuals, with success rates ranging from 93 % to 100 %. We further showed that the accuracy of the model can be significantly increased by refining the informative sites through FST filtering. Our method is robust to configurations related to batch sizes and epochs, whereas model learning benefits from the setting of a proper preset learning rate. Moreover, we explained the importance score of key sites for algorithm interpretability and credibility, which has been largely ignored. We anticipate that by coupling genomics and deep learning, our method will see broad potential in conservation and management applications that involve natural resources, invasive pests and weeds, and illegal trades of wildlife products.
Collapse
Affiliation(s)
- Bing Yang
- Department of Entomology, China Agricultural University, Beijing 100193, China
| | - Xin Zhou
- Department of Entomology, China Agricultural University, Beijing 100193, China.
| | - Shanlin Liu
- Department of Entomology, China Agricultural University, Beijing 100193, China; Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China.
| |
Collapse
|
6
|
Suvorov A, Schrider DR. Reliable estimation of tree branch lengths using deep neural networks. PLoS Comput Biol 2024; 20:e1012337. [PMID: 39102450 PMCID: PMC11326709 DOI: 10.1371/journal.pcbi.1012337] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Revised: 08/15/2024] [Accepted: 07/18/2024] [Indexed: 08/07/2024] Open
Abstract
A phylogenetic tree represents hypothesized evolutionary history for a set of taxa. Besides the branching patterns (i.e., tree topology), phylogenies contain information about the evolutionary distances (i.e. branch lengths) between all taxa in the tree, which include extant taxa (external nodes) and their last common ancestors (internal nodes). During phylogenetic tree inference, the branch lengths are typically co-estimated along with other phylogenetic parameters during tree topology space exploration. There are well-known regions of the branch length parameter space where accurate estimation of phylogenetic trees is especially difficult. Several novel studies have recently demonstrated that machine learning approaches have the potential to help solve phylogenetic problems with greater accuracy and computational efficiency. In this study, as a proof of concept, we sought to explore the possibility of machine learning models to predict branch lengths. To that end, we designed several deep learning frameworks to estimate branch lengths on fixed tree topologies from multiple sequence alignments or its representations. Our results show that deep learning methods can exhibit superior performance in some difficult regions of branch length parameter space. For example, in contrast to maximum likelihood inference, which is typically used for estimating branch lengths, deep learning methods are more efficient and accurate. In general, we find that our neural networks achieve similar accuracy to a Bayesian approach and are the best-performing methods when inferring long branches that are associated with distantly related taxa. Together, our findings represent a next step toward accurate, fast, and reliable phylogenetic inference with machine learning approaches.
Collapse
Affiliation(s)
- Anton Suvorov
- Department of Biological Sciences, Virginia Tech, Blacksburg, Virginia, United States of America
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| | - Daniel R Schrider
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| |
Collapse
|
7
|
Mo YK, Hahn MW, Smith ML. Applications of machine learning in phylogenetics. Mol Phylogenet Evol 2024; 196:108066. [PMID: 38565358 DOI: 10.1016/j.ympev.2024.108066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 02/16/2024] [Accepted: 03/21/2024] [Indexed: 04/04/2024]
Abstract
Machine learning has increasingly been applied to a wide range of questions in phylogenetic inference. Supervised machine learning approaches that rely on simulated training data have been used to infer tree topologies and branch lengths, to select substitution models, and to perform downstream inferences of introgression and diversification. Here, we review how researchers have used several promising machine learning approaches to make phylogenetic inferences. Despite the promise of these methods, several barriers prevent supervised machine learning from reaching its full potential in phylogenetics. We discuss these barriers and potential paths forward. In the future, we expect that the application of careful network designs and data encodings will allow supervised machine learning to accommodate the complex processes that continue to confound traditional phylogenetic methods.
Collapse
Affiliation(s)
- Yu K Mo
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA
| | - Matthew W Hahn
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA; Department of Biology, Indiana University, Bloomington, IN 47405, USA
| | - Megan L Smith
- Department of Biological Sciences, Mississippi State University, Starkville, MS 39762, USA.
| |
Collapse
|
8
|
Leuchtenberger AF, von Haeseler A. Learning From an Artificial Neural Network in Phylogenetics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:278-288. [PMID: 38198267 DOI: 10.1109/tcbb.2024.3352268] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2024]
Abstract
We show that an iterative ansatz of deep learning and human intelligence guided simplification may lead to surprisingly simple solutions for a difficult problem in phylogenetics. Distinguishing Farris and Felsenstein trees is a longstanding problem in phylogenetic tree reconstruction. The Artificial Neural Network F-zoneNN solves this problem for 4-taxon alignments evolved under the Jukes-Cantor model. It distinguishes between Farris and Felsenstein trees, but owing to its complexity, lacks transparency in its mechanism of discernment. Based on the simplification of F-zoneNN and alignment properties we constructed the function FarFelDiscerner. In contrast to F-zoneNN, FarFelDiscerner's decision process is understandable. Moreover, FarFelDiscerner is significantly simpler than F-zoneNN. Despite its simplicity this function infers the tree-type almost perfectly on noise-free data, and also performs well on simulated noisy alignments of finite length. We applied FarFelDiscerner to the historical Holometabola alignments where it places Strepsiptera with beetles, concordant with the current scientific view.
Collapse
|
9
|
Tang X, Zepeda-Nuñez L, Yang S, Zhao Z, Solís-Lemus C. Novel symmetry-preserving neural network model for phylogenetic inference. BIOINFORMATICS ADVANCES 2024; 4:vbae022. [PMID: 38638281 PMCID: PMC11026143 DOI: 10.1093/bioadv/vbae022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 01/29/2024] [Accepted: 02/17/2024] [Indexed: 04/20/2024]
Abstract
Motivation Scientists world-wide are putting together massive efforts to understand how the biodiversity that we see on Earth evolved from single-cell organisms at the origin of life and this diversification process is represented through the Tree of Life. Low sampling rates and high heterogeneity in the rate of evolution across sites and lineages produce a phenomenon denoted "long branch attraction" (LBA) in which long nonsister lineages are estimated to be sisters regardless of their true evolutionary relationship. LBA has been a pervasive problem in phylogenetic inference affecting different types of methodologies from distance-based to likelihood-based. Results Here, we present a novel neural network model that outperforms standard phylogenetic methods and other neural network implementations under LBA settings. Furthermore, unlike existing neural network models in phylogenetics, our model naturally accounts for the tree isomorphisms via permutation invariant functions which ultimately result in lower memory and allows the seamless extension to larger trees. Availability and implementation We implement our novel theory on an open-source publicly available GitHub repository: https://github.com/crsl4/nn-phylogenetics.
Collapse
Affiliation(s)
- Xudong Tang
- Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI 53706, United States
- Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, United States
| | - Leonardo Zepeda-Nuñez
- Department of Mathematics, University of Wisconsin-Madison, Madison, WI 53706, United States
| | - Shengwen Yang
- Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI 53706, United States
- Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, United States
| | - Zelin Zhao
- Department of Mathematics, University of Wisconsin-Madison, Madison, WI 53706, United States
| | - Claudia Solís-Lemus
- Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI 53706, United States
- Department of Plant Pathology, University of Wisconsin-Madison, Madison, WI 53706, United States
| |
Collapse
|
10
|
Burgstaller-Muehlbacher S, Crotty SM, Schmidt HA, Reden F, Drucks T, von Haeseler A. ModelRevelator: Fast phylogenetic model estimation via deep learning. Mol Phylogenet Evol 2023; 188:107905. [PMID: 37595933 DOI: 10.1016/j.ympev.2023.107905] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Accepted: 08/12/2023] [Indexed: 08/20/2023]
Abstract
Selecting the best model of sequence evolution for a multiple-sequence-alignment (MSA) constitutes the first step of phylogenetic tree reconstruction. Common approaches for inferring nucleotide models typically apply maximum likelihood (ML) methods, with discrimination between models determined by one of several information criteria. This requires tree reconstruction and optimisation which can be computationally expensive. We demonstrate that neural networks can be used to perform model selection, without the need to reconstruct trees, optimise parameters, or calculate likelihoods. We introduce ModelRevelator, a model selection tool underpinned by two deep neural networks. The first neural network, NNmodelfind, recommends one of six commonly used models of sequence evolution, ranging in complexity from Jukes and Cantor to General Time Reversible. The second, NNalphafind, recommends whether or not a Γ-distributed rate heterogeneous model should be incorporated, and if so, provides an estimate of the shape parameter, ɑ. Users can simply input an MSA into ModelRevelator, and swiftly receive output recommending the evolutionary model, inclusive of the presence or absence of rate heterogeneity, and an estimate of ɑ. We show that ModelRevelator performs comparably with likelihood-based methods and the recently published machine learning method ModelTeller over a wide range of parameter settings, with significant potential savings in computational effort. Further, we show that this performance is not restricted to the alignments on which the networks were trained, but is maintained even on unseen empirical data. We expect that ModelRevelator will provide a valuable alternative for phylogeneticists, especially where traditional methods of model selection are computationally prohibitive.
Collapse
Affiliation(s)
- Sebastian Burgstaller-Muehlbacher
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna and Medical University of Vienna, Vienna BioCenter (VBC) 5, 1030 Vienna, Austria.
| | - Stephen M Crotty
- School of Mathematical Sciences, University of Adelaide, Adelaide, SA 5005, Australia; ARC Centre of Excellence for Mathematical and Statistical Frontiers, University of Adelaide, Adelaide, SA 5005, Australia
| | - Heiko A Schmidt
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna and Medical University of Vienna, Vienna BioCenter (VBC) 5, 1030 Vienna, Austria
| | - Franziska Reden
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna and Medical University of Vienna, Vienna BioCenter (VBC) 5, 1030 Vienna, Austria
| | - Tamara Drucks
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna and Medical University of Vienna, Vienna BioCenter (VBC) 5, 1030 Vienna, Austria; Research Unit Machine Learning, TU Wien, 1040 Vienna, Austria
| | - Arndt von Haeseler
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna and Medical University of Vienna, Vienna BioCenter (VBC) 5, 1030 Vienna, Austria; Bioinformatics and Computational Biology, Faculty of Computer Science, University of Vienna, Währinger Straße 29, 1090 Vienna, Austria
| |
Collapse
|
11
|
Köksal Z, Børsting C, Gusmão L, Pereira V. SNPtotree-Resolving the Phylogeny of SNPs on Non-Recombining DNA. Genes (Basel) 2023; 14:1837. [PMID: 37895186 PMCID: PMC10606150 DOI: 10.3390/genes14101837] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Revised: 09/18/2023] [Accepted: 09/21/2023] [Indexed: 10/29/2023] Open
Abstract
Genetic variants on non-recombining DNA and the hierarchical order in which they accumulate are commonly of interest. This variant hierarchy can be established and combined with information on the population and geographic origin of the individuals carrying the variants to find population structures and infer migration patterns. Further, individuals can be assigned to the characterized populations, which is relevant in forensic genetics, genetic genealogy, and epidemiologic studies. However, there is currently no straightforward method to obtain such a variant hierarchy. Here, we introduce the software SNPtotree v1.0, which uniquely determines the hierarchical order of variants on non-recombining DNA without error-prone manual sorting. The algorithm uses pairwise variant comparisons to infer their relationships and integrates the combined information into a phylogenetic tree. Variants that have contradictory pairwise relationships or ambiguous positions in the tree are removed by the software. When benchmarked using two human Y-chromosomal massively parallel sequencing datasets, SNPtotree outperforms traditional methods in the accuracy of phylogenetic trees for sequencing data with high amounts of missing information. The phylogenetic trees of variants created using SNPtotree can be used to establish and maintain publicly available phylogeny databases to further explore genetic epidemiology and genealogy, as well as population and forensic genetics.
Collapse
Affiliation(s)
- Zehra Köksal
- Section of Forensic Genetics, Department of Forensic Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, 2100 Copenhagen, Denmark; (Z.K.); (C.B.)
| | - Claus Børsting
- Section of Forensic Genetics, Department of Forensic Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, 2100 Copenhagen, Denmark; (Z.K.); (C.B.)
| | - Leonor Gusmão
- DNA Diagnostic Laboratory (LDD), State University of Rio de Janeiro (UERJ), Rio de Janeiro 20550-013, Brazil;
| | - Vania Pereira
- Section of Forensic Genetics, Department of Forensic Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, 2100 Copenhagen, Denmark; (Z.K.); (C.B.)
| |
Collapse
|
12
|
Ly-Trong N, Barca GMJ, Minh BQ. AliSim-HPC: parallel sequence simulator for phylogenetics. Bioinformatics 2023; 39:btad540. [PMID: 37656933 PMCID: PMC10534053 DOI: 10.1093/bioinformatics/btad540] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Revised: 08/16/2023] [Accepted: 08/31/2023] [Indexed: 09/03/2023] Open
Abstract
MOTIVATION Sequence simulation plays a vital role in phylogenetics with many applications, such as evaluating phylogenetic methods, testing hypotheses, and generating training data for machine-learning applications. We recently introduced a new simulator for multiple sequence alignments called AliSim, which outperformed existing tools. However, with the increasing demands of simulating large data sets, AliSim is still slow due to its sequential implementation; for example, to simulate millions of sequence alignments, AliSim took several days or weeks. Parallelization has been used for many phylogenetic inference methods but not yet for sequence simulation. RESULTS This paper introduces AliSim-HPC, which, for the first time, employs high-performance computing for phylogenetic simulations. AliSim-HPC parallelizes the simulation process at both multi-core and multi-CPU levels using the OpenMP and message passing interface (MPI) libraries, respectively. AliSim-HPC is highly efficient and scalable, which reduces the runtime to simulate 100 large gap-free alignments (30 000 sequences of one million sites) from over one day to 11 min using 256 CPU cores from a cluster with six computing nodes, a 153-fold speedup. While the OpenMP version can only simulate gap-free alignments, the MPI version supports insertion-deletion models like the sequential AliSim. AVAILABILITY AND IMPLEMENTATION AliSim-HPC is open-source and available as part of the new IQ-TREE version v2.2.3 at https://github.com/iqtree/iqtree2/releases with a user manual at http://www.iqtree.org/doc/AliSim.
Collapse
Affiliation(s)
- Nhan Ly-Trong
- School of Computing, College of Engineering, Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia
| | - Giuseppe M J Barca
- School of Computing, College of Engineering, Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia
| | - Bui Quang Minh
- School of Computing, College of Engineering, Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia
| |
Collapse
|
13
|
Szánthó LL, Lartillot N, Szöllősi GJ, Schrempf D. Compositionally Constrained Sites Drive Long-Branch Attraction. Syst Biol 2023; 72:767-780. [PMID: 36946562 PMCID: PMC10405358 DOI: 10.1093/sysbio/syad013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2022] [Revised: 03/01/2023] [Accepted: 03/16/2023] [Indexed: 03/23/2023] Open
Abstract
Accurate phylogenies are fundamental to our understanding of the pattern and process of evolution. Yet, phylogenies at deep evolutionary timescales, with correspondingly long branches, have been fraught with controversy resulting from conflicting estimates from models with varying complexity and goodness of fit. Analyses of historical as well as current empirical datasets, such as alignments including Microsporidia, Nematoda, or Platyhelminthes, have demonstrated that inadequate modeling of across-site compositional heterogeneity, which is the result of biochemical constraints that lead to varying patterns of accepted amino acids along sequences, can lead to erroneous topologies that are strongly supported. Unfortunately, models that adequately account for across-site compositional heterogeneity remain computationally challenging or intractable for an increasing fraction of contemporary datasets. Here, we introduce "compositional constraint analysis," a method to investigate the effect of site-specific constraints on amino acid composition on phylogenetic inference. We show that more constrained sites with lower diversity and less constrained sites with higher diversity exhibit ostensibly conflicting signals under models ignoring across-site compositional heterogeneity that lead to long-branch attraction artifacts and demonstrate that more complex models accounting for across-site compositional heterogeneity can ameliorate this bias. We present CAT-posterior mean site frequencies (PMSF), a pipeline for diagnosing and resolving phylogenetic bias resulting from inadequate modeling of across-site compositional heterogeneity based on the CAT model. CAT-PMSF is robust against long-branch attraction in all alignments we have examined. We suggest using CAT-PMSF when convergence of the CAT model cannot be assured. We find evidence that compositionally constrained sites are driving long-branch attraction in two metazoan datasets and recover evidence for Porifera as the sister group to all other animals. [Animal phylogeny; cross-site heterogeneity; long-branch attraction; phylogenomics.].
Collapse
Affiliation(s)
- Lénárd L Szánthó
- Department of Biological Physics, Eötvös University, Budapest, Hungary
- ELTE-MTA “Lendület” Evolutionary Genomics Research Group, Budapest, Hungary
- Institute of Evolution, Centre for Ecological Research, Budapest, Hungary
| | - Nicolas Lartillot
- Laboratoire de Biométrie et Biologie Evolutive UMR 5558, CNRS, Université de Lyon, Villeurbanne, France
| | - Gergely J Szöllősi
- Department of Biological Physics, Eötvös University, Budapest, Hungary
- ELTE-MTA “Lendület” Evolutionary Genomics Research Group, Budapest, Hungary
- Institute of Evolution, Centre for Ecological Research, Budapest, Hungary
| | - Dominik Schrempf
- Department of Biological Physics, Eötvös University, Budapest, Hungary
| |
Collapse
|
14
|
Smith BT, Merwin J, Provost KL, Thom G, Brumfield RT, Ferreira M, Mauck Iii WM, Moyle RG, Wright T, Joseph L. Phylogenomic analysis of the parrots of the world distinguishes artifactual from biological sources of gene tree discordance. Syst Biol 2022; 72:228-241. [PMID: 35916751 DOI: 10.1093/sysbio/syac055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2021] [Revised: 02/22/2022] [Accepted: 07/22/2022] [Indexed: 11/14/2022] Open
Abstract
Gene tree discordance is expected in phylogenomic trees and biological processes are often invoked to explain it. However, heterogeneous levels of phylogenetic signal among individuals within datasets may cause artifactual sources of topological discordance. We examined how the information content in tips and subclades impacts topological discordance in the parrots (Order: Psittaciformes), a diverse and highly threatened clade of nearly 400 species. Using ultraconserved elements from 96% of the clade's species-level diversity, we estimated concatenated and species trees for 382 ingroup taxa. We found that discordance among tree topologies was most common at nodes dating between the late Miocene and Pliocene, and often at the taxonomic level of genus. Accordingly, we used two metrics to characterize information content in tips and assess the degree to which conflict between trees was being driven by lower quality samples. Most instances of topological conflict and non-monophyletic genera in the species tree could be objectively identified using these metrics. For subclades still discordant after tip-based filtering, we used a machine learning approach to determine whether phylogenetic signal or noise was the more important predictor of metrics supporting the alternative topologies. We found that when signal favored one of the topologies, noise was the most important variable in poorly performing models that favored the alternative topology. In sum, we show that artifactual sources of gene tree discordance, which are likely a common phenomenon in many datasets, can be distinguished from biological sources by quantifying the information content in each tip and modeling which factors support each topology.
Collapse
Affiliation(s)
- Brian Tilston Smith
- Department of Ornithology, American Museum of Natural History, Central Park West at 79th Street, New York, NY 10024, USA
| | - Jon Merwin
- Department of Ornithology, Academy of Natural Sciences of Drexel University, 1900 Benjamin Franklin Parkway, Philadelphia, PA 19103, USA.,Department of Biodiversity, Earth, and Environmental Science, Drexel University, Philadelphia, PA 19103, USA
| | - Kaiya L Provost
- Department of Evolution, Ecology, and Organismal Biology, The Ohio State University, 318 W. 12th Avenue, Columbus, OH 43210, USA
| | - Gregory Thom
- Museum of Natural Science and Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA
| | - Robb T Brumfield
- Museum of Natural Science and Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA
| | - Mateus Ferreira
- Centro de Estudos da Biodiversidade, Universidade Federal de Roraima, Av. Cap. Ene Garcez, 2413, Boa Vista, RR, Brazil
| | - William M Mauck Iii
- Department of Ornithology, American Museum of Natural History, Central Park West at 79th Street, New York, NY 10024, USA
| | - Robert G Moyle
- Department of Ecology and Evolutionary Biology and Biodiversity Institute, University of Kansas, 1345 Jayhawk Blvd., Lawrence, KS 66045, USA
| | - Timothy Wright
- Department of Biology, New Mexico State University, Las Cruces, NM, 88003, USA
| | - Leo Joseph
- Australian National Wildlife Collection, National Research Collections Australia, CSIRO, GPO Box 1700, Canberra, ACT, 2601, Australia
| |
Collapse
|
15
|
Ly-Trong N, Naser-Khdour S, Lanfear R, Minh BQ. AliSim: a fast and versatile phylogenetic sequence simulator for the genomic era. Mol Biol Evol 2022; 39:6577219. [PMID: 35511713 PMCID: PMC9113491 DOI: 10.1093/molbev/msac092] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Abstract
Sequence simulators play an important role in phylogenetics. Simulated data has many applications, such as evaluating the performance of different methods, hypothesis testing with parametric bootstraps, and, more recently, generating data for training machine-learning applications. Many sequence simulation programs exist, but the most feature-rich programs tend to be rather slow, and the fastest programs tend to be feature-poor. Here, we introduce AliSim, a new tool that can efficiently simulate biologically realistic alignments under a large range of complex evolutionary models. To achieve high performance across a wide range of simulation conditions, AliSim implements an adaptive approach that combines the commonly-used rate matrix and probability matrix approaches. AliSim takes 1.4 hours and 1.3 GB RAM to simulate alignments with one million sequences or sites, while popular software Seq-Gen, Dawg, and INDELible require two to five hours and 50 to 500 GB of RAM. We provide AliSim as an extension of the IQ-TREE software version 2.2, freely available at www.iqtree.org, and a comprehensive user tutorial at http://www.iqtree.org/doc/AliSim.
Collapse
Affiliation(s)
- Nhan Ly-Trong
- School of Computing, College of Engineering and Computer Science, Australian National University, Canberra, ACT 2600, Australia
| | - Suha Naser-Khdour
- Ecology and Evolution, Research School of Biology, College of Science, Australian National University, Canberra, ACT 2600, Australia
| | - Robert Lanfear
- Ecology and Evolution, Research School of Biology, College of Science, Australian National University, Canberra, ACT 2600, Australia
| | - Bui Quang Minh
- School of Computing, College of Engineering and Computer Science, Australian National University, Canberra, ACT 2600, Australia
| |
Collapse
|
16
|
De Maio N, Boulton W, Weilguny L, Walker CR, Turakhia Y, Corbett-Detig R, Goldman N. phastSim: Efficient simulation of sequence evolution for pandemic-scale datasets. PLoS Comput Biol 2022; 18:e1010056. [PMID: 35486906 PMCID: PMC9094560 DOI: 10.1371/journal.pcbi.1010056] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2021] [Revised: 05/11/2022] [Accepted: 03/25/2022] [Indexed: 11/26/2022] Open
Abstract
Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, and are an essential component of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here, we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100, 000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and it implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software allows easy integration with other Python packages as well as a variety of evolutionary models, including indel models and new hypermutability models that we developed to more realistically represent SARS-CoV-2 genome evolution.
Collapse
Affiliation(s)
- Nicola De Maio
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
| | - William Boulton
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
| | - Lukas Weilguny
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
| | - Conor R. Walker
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
- Department of Genetics, University of Cambridge, Cambridge, United Kingdom
| | - Yatish Turakhia
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, California, United States of America
| | - Russell Corbett-Detig
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America
- Genomics Institute, University of California Santa Cruz, Santa Cruz, California, United States of America
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
| |
Collapse
|
17
|
Zaharias P, Grosshauser M, Warnow T. Re-evaluating Deep Neural Networks for Phylogeny Estimation: The Issue of Taxon Sampling. J Comput Biol 2022; 29:74-89. [PMID: 34986031 DOI: 10.1089/cmb.2021.0383] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Deep neural networks (DNNs) have been recently proposed for quartet tree phylogeny estimation. Here, we present a study evaluating recently trained DNNs in comparison to a collection of standard phylogeny estimation methods on a heterogeneous collection of datasets simulated under the same models that were used to train the DNNs, and also under similar conditions but with higher rates of evolution. Our study shows that using DNNs with quartet amalgamation is less accurate than several standard phylogeny estimation methods we explore (e.g., maximum likelihood and maximum parsimony). We further find that simple standard phylogeny estimation methods match or improve on DNNs for quartet accuracy, especially, but not exclusively, when used in a global manner (i.e., the tree on the full dataset is computed and then the induced quartet trees are extracted from the full tree). Thus, our study provides evidence that a major challenge impacting the utility of current DNNs for phylogeny estimation is their restriction to estimating quartet trees that must subsequently be combined into a tree on the full dataset. In contrast, global methods (i.e., those that estimate trees from the full set of sequences) are able to benefit from taxon sampling, and hence have higher accuracy on large datasets.
Collapse
Affiliation(s)
- Paul Zaharias
- Department of Computer Science, University of Illinois, Urbana, Illinois, USA
| | | | - Tandy Warnow
- Department of Computer Science, University of Illinois, Urbana, Illinois, USA
| |
Collapse
|
18
|
De Maio N, Boulton W, Weilguny L, Walker CR, Turakhia Y, Corbett-Detig R, Goldman N. phastSim: efficient simulation of sequence evolution for pandemic-scale datasets. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2021:2021.03.15.435416. [PMID: 33758852 PMCID: PMC7987011 DOI: 10.1101/2021.03.15.435416] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, as well as being part of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100,000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software is available from https://github.com/NicolaDM/phastSim and allows easy integration with other Python packages as well as a variety of evolutionary models, including indel models and new hypermutatability models that we developed to more realistically represent SARS-CoV-2 genome evolution.
Collapse
Affiliation(s)
- Nicola De Maio
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - William Boulton
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Lukas Weilguny
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Conor R. Walker
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
- Department of Genetics, University of Cambridge, Cambridge, CB2 3EH, UK
| | - Yatish Turakhia
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, CA 92093, USA
| | - Russell Corbett-Detig
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| |
Collapse
|
19
|
Azouri D, Abadi S, Mansour Y, Mayrose I, Pupko T. Harnessing machine learning to guide phylogenetic-tree search algorithms. Nat Commun 2021; 12:1983. [PMID: 33790270 PMCID: PMC8012635 DOI: 10.1038/s41467-021-22073-8] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Accepted: 02/26/2021] [Indexed: 02/01/2023] Open
Abstract
Inferring a phylogenetic tree is a fundamental challenge in evolutionary studies. Current paradigms for phylogenetic tree reconstruction rely on performing costly likelihood optimizations. With the aim of making tree inference feasible for problems involving more than a handful of sequences, inference under the maximum-likelihood paradigm integrates heuristic approaches to evaluate only a subset of all potential trees. Consequently, existing methods suffer from the known tradeoff between accuracy and running time. In this proof-of-concept study, we train a machine-learning algorithm over an extensive cohort of empirical data to predict the neighboring trees that increase the likelihood, without actually computing their likelihood. This provides means to safely discard a large set of the search space, thus potentially accelerating heuristic tree searches without losing accuracy. Our analyses suggest that machine learning can guide tree-search methodologies towards the most promising candidate trees.
Collapse
Affiliation(s)
- Dana Azouri
- School of Plant Sciences and Food Security, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel
- The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel
| | - Shiran Abadi
- School of Plant Sciences and Food Security, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel
| | - Yishay Mansour
- Balvatnik School of Computer Science, Tel-Aviv University, Ramat Aviv, Tel-Aviv, Israel
| | - Itay Mayrose
- School of Plant Sciences and Food Security, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel.
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel.
| |
Collapse
|