1
|
Brazão JM, Foster PG, Cox CJ. Data-specific substitution models improve protein-based phylogenetics. PeerJ 2023; 11:e15716. [PMID: 37576497 PMCID: PMC10416777 DOI: 10.7717/peerj.15716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2023] [Accepted: 06/16/2023] [Indexed: 08/15/2023] Open
Abstract
Calculating amino-acid substitution models that are specific for individual protein data sets is often difficult due to the computational burden of estimating large numbers of rate parameters. In this study, we tested the computational efficiency and accuracy of five methods used to estimate substitution models, namely Codeml, FastMG, IQ-TREE, P4 (maximum likelihood), and P4 (Bayesian inference). Data-specific substitution models were estimated from simulated alignments (with different lengths) that were generated from a known simulation model and simulation tree. Each of the resulting data-specific substitution models was used to calculate the maximum likelihood score of the simulation tree and simulated data that was used to calculate the model, and compared with the maximum likelihood scores of the known simulation model and simulation tree on the same simulated data. Additionally, the commonly-used empirical models, cpREV and WAG, were assessed similarly. Data-specific models performed better than the empirical models, which under-fitted the simulated alignments, had the highest difference to the simulation model maximum-likelihood score, clustered further from the simulation model in principal component analysis ordination, and inferred less accurate trees. Data-specific models and the simulation model shared statistically indistinguishable maximum-likelihood scores, indicating that the five methods were reasonably accurate at estimating substitution models by this measure. Nevertheless, tree statistics showed differences between optimal maximum likelihood trees. Unlike other model estimating methods, trees inferred using data-specific models generated with IQ-TREE and P4 (maximum likelihood) were not significantly different from the trees derived from the simulation model in each analysis, indicating that these two methods alone were the most accurate at estimating data-specific models. To show the benefits of using data-specific protein models several published data sets were reanalysed using IQ-TREE-estimated models. These newly estimated models were a better fit to the data than the empirical models that were used by the original authors, often inferred longer trees, and resulted in different tree topologies in more than half of the re-analysed data sets. The results of this study show that software availability and high computation burden are not limitations to generating better-fitting data-specific amino-acid substitution models for phylogenetic analyses.
Collapse
Affiliation(s)
- João M. Brazão
- Centro de Ciências do Mar, Universidade do Algarve, Faro, Algarve, Portugal
| | - Peter G. Foster
- Department of Life Sciences, Natural History Museum, London, United Kingdom
| | - Cymon J. Cox
- Centro de Ciências do Mar, Universidade do Algarve, Faro, Algarve, Portugal
| |
Collapse
|
2
|
Rangel LT, Soucy SM, Setubal JC, Gogarten JP, Fournier GP. An efficient, non-phylogenetic method for detecting genes sharing evolutionary signals in phylogenomic datasets. Genome Biol Evol 2021; 13:6352501. [PMID: 34390574 PMCID: PMC8483891 DOI: 10.1093/gbe/evab187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/11/2021] [Indexed: 11/25/2022] Open
Abstract
Assessing the compatibility between gene family phylogenies is a crucial and often computationally demanding step in many phylogenomic analyses. Here, we describe the Evolutionary Similarity Index (IES), a means to assess shared evolution between gene families using a weighted orthogonal distance regression model applied to sequence distances. The utilization of pairwise distance matrices circumvents comparisons between gene tree topologies, which are inherently uncertain and sensitive to evolutionary model choice, phylogenetic reconstruction artifacts, and other sources of error. Furthermore, IES enables the many-to-many pairing of multiple copies between similarly evolving gene families. This is done by selecting non-overlapping pairs of copies, one from each assessed family, and yielding the least sum of squared residuals. Analyses of simulated gene family data sets show that IES’s accuracy is on par with popular tree-based methods while also less susceptible to noise introduced by sequence alignment and evolutionary model fitting. Applying IES to an empirical data set of 1,322 genes from 42 archaeal genomes identified eight major clusters of gene families with compatible evolutionary trends. The most cohesive cluster consisted of 62 genes with compatible evolutionary signal, which occur as both single-copy and multiple homologs per genome; phylogenetic analysis of concatenated alignments from this cluster produced a tree closely matching previously published species trees for Archaea. Four other clusters are mainly composed of accessory genes with limited distribution among Archaea and enriched toward specific metabolic functions. Pairwise evolutionary distances obtained from these accessory gene clusters suggest patterns of interphyla horizontal gene transfer. An IES implementation is available at https://github.com/lthiberiol/evolSimIndex.
Collapse
Affiliation(s)
- Luiz Thibério Rangel
- Department of Earth, Atmospheric and Planetary Sciences, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
- Corresponding author: E-mail:
| | - Shannon M Soucy
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Hanover, New Hampshire, USA
| | - João C Setubal
- Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo, Brasil
| | - Johann Peter Gogarten
- Department of Molecular and Cell Biology, University of Connecticut, USA
- Institute for Systems Genomics, University of Connecticut, USA
| | - Gregory P Fournier
- Department of Earth, Atmospheric and Planetary Sciences, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| |
Collapse
|
3
|
Irisarri I, Strassert JFH, Burki F. Phylogenomic Insights into the Origin of Primary Plastids. Syst Biol 2021; 71:105-120. [PMID: 33988690 DOI: 10.1093/sysbio/syab036] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Revised: 05/07/2021] [Accepted: 05/10/2021] [Indexed: 11/13/2022] Open
Abstract
The origin of plastids was a major evolutionary event that paved the way for an astonishing diversification of photosynthetic eukaryotes. Plastids originated by endosymbiosis between a heterotrophic eukaryotic host and cyanobacteria, presumably in a common ancestor of the primary photosynthetic eukaryotes (Archaeplastida). A single origin of primary plastids is well supported by plastid evidence but not by nuclear phylogenomic analyses, which have consistently failed to recover the monophyly of Archaeplastida hosts. Importantly, plastid monophyly and non-monophyletic hosts could be explained under scenarios of independent or serial eukaryote-to-eukaryote endosymbioses. Here, we assessed the strength of the signal for the monophyly of Archaeplastida hosts in four available phylogenomic datasets. The effect of phylogenetic methodology, data quality, alignment trimming strategy, gene and taxon sampling, and the presence of outlier genes were investigated. Our analyses revealed a lack of support for host monophyly in the shorter individual datasets. However, when analyzed together under rigorous data curation and complex mixture models, the combined nuclear datasets supported the monophyly of primary photosynthetic eukaryotes (Archaeplastida) and revealed a putative association with plastid-lacking Picozoa. This study represents an important step towards better understanding deep eukaryotic evolution and the origin of plastids.
Collapse
Affiliation(s)
- Iker Irisarri
- Department of Organismal Biology (Systematic Biology), Uppsala University, Norbyv. 18D, 75236 Uppsala, Sweden.,Department of Biodiversity and Evolutionary Biology, Museo Nacional de Ciencias Naturales, José Gutiérrez Abascal 2, 28006 Madrid, Spain
| | - Jürgen F H Strassert
- Department of Organismal Biology (Systematic Biology), Uppsala University, Norbyv. 18D, 75236 Uppsala, Sweden.,Department of Ecosystem Research, Leibniz Institute of Freshwater Ecology and Inland Fisheries, Müggelseedamm 301, 12587 Berlin, Germany
| | - Fabien Burki
- Department of Organismal Biology (Systematic Biology), Uppsala University, Norbyv. 18D, 75236 Uppsala, Sweden.,Science For Life Laboratory, Uppsala University, 75236 Sweden
| |
Collapse
|
4
|
Bansal MS. Linear-time algorithms for phylogenetic tree completion under Robinson-Foulds distance. Algorithms Mol Biol 2020; 15:6. [PMID: 32313549 PMCID: PMC7155338 DOI: 10.1186/s13015-020-00166-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2020] [Accepted: 04/04/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We consider two fundamental computational problems that arise when comparing phylogenetic trees, rooted or unrooted, with non-identical leaf sets. The first problem arises when comparing two trees where the leaf set of one tree is a proper subset of the other. The second problem arises when the two trees to be compared have only partially overlapping leaf sets. The traditional approach to handling these problems is to first restrict the two trees to their common leaf set. An alternative approach that has shown promise is to first complete the trees by adding missing leaves, so that the resulting trees have identical leaf sets. This requires the computation of an optimal completion that minimizes the distance between the two resulting trees over all possible completions. RESULTS We provide optimal linear-time algorithms for both completion problems under the widely-used Robinson-Foulds (RF) distance measure. Our algorithm for the first problem improves the time complexity of the current fastest algorithm from quadratic (in the size of the two trees) to linear. No algorithms have yet been proposed for the more general second problem where both trees have missing leaves. We advance the study of this general problem by proposing a useful restricted version of the general problem and providing optimal linear-time algorithms for the restricted version. Our experimental results on biological data sets suggest that completion-based RF distances can be very different compared to traditional RF distances.
Collapse
|
5
|
Avino M, Ng GT, He Y, Renaud MS, Jones BR, Poon AFY. Tree shape-based approaches for the comparative study of cophylogeny. Ecol Evol 2019; 9:6756-6771. [PMID: 31312429 PMCID: PMC6618157 DOI: 10.1002/ece3.5185] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2019] [Revised: 02/21/2019] [Accepted: 03/29/2019] [Indexed: 12/17/2022] Open
Abstract
Cophylogeny is the congruence of phylogenetic relationships between two different groups of organisms due to their long-term interaction. We investigated the use of tree shape distance measures to quantify the degree of cophylogeny. We implemented a reverse-time simulation model of pathogen phylogenies within a fixed host tree, given cospeciation probability, host switching, and pathogen speciation rates. We used this model to evaluate 18 distance measures between host and pathogen trees including two kernel distances that we developed for labeled and unlabeled trees, which use branch lengths and accommodate different size trees. Finally, we used these measures to revisit published cophylogenetic studies, where authors described the observed associations as representing a high or low degree of cophylogeny. Our simulations demonstrated that some measures are more informative than others with respect to specific coevolution parameters especially when these did not assume extreme values. For real datasets, trees' associations projection revealed clustering of high concordance studies suggesting that investigators are describing it in a consistent way. Our results support the hypothesis that measures can be useful for quantifying cophylogeny. This motivates their usage in the field of coevolution and supports the development of simulation-based methods, i.e., approximate Bayesian computation, to estimate the underlying coevolutionary parameters.
Collapse
Affiliation(s)
- Mariano Avino
- Department of Pathology and Laboratory Medicine Western University London Ontario Canada
| | - Garway T Ng
- Department of Pathology and Laboratory Medicine Western University London Ontario Canada
| | - Yiying He
- Department of Pathology and Laboratory Medicine Western University London Ontario Canada
| | - Mathias S Renaud
- Department of Pathology and Laboratory Medicine Western University London Ontario Canada
| | - Bradley R Jones
- BC Centre for Excellence in HIV/AIDS Vancouver British Columbia Canada
| | - Art F Y Poon
- Department of Pathology and Laboratory Medicine Western University London Ontario Canada.,Department of Applied Mathematics Western University London Ontario Canada
| |
Collapse
|
6
|
Blackburne BP, Whelan S. Class of multiple sequence alignment algorithm affects genomic analysis. Mol Biol Evol 2012; 30:642-53. [PMID: 23144040 DOI: 10.1093/molbev/mss256] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
Multiple sequence alignment (MSA) is the heart of comparative sequence analysis. Recent studies demonstrate that MSA algorithms can produce different outcomes when analyzing genomes, including phylogenetic tree inference and the detection of adaptive evolution. These studies also suggest that the difference between MSA algorithms is of a similar order to the uncertainty within an algorithm and suggest integrating across this uncertainty. In this study, we examine further the problem of disagreements between MSA algorithms and how they affect downstream analyses. We also investigate whether integrating across alignment uncertainty affects downstream analyses. We address these questions by analyzing 200 chordate gene families, with properties reflecting those used in large-scale genomic analyses. We find that newly developed distance metrics reveal two significantly different classes of MSA methods (MSAMs). The similarity-based class includes progressive aligners and consistency aligners, representing many methodological innovations for sequence alignment, whereas the evolution-based class includes phylogenetically aware alignment and statistical alignment. We proceed to show that the class of an MSAM has a substantial impact on downstream analyses. For phylogenetic inference, tree estimates and their branch lengths appear highly dependent on the class of aligner used. The number of families, and the sites within those families, inferred to have undergone adaptive evolution depend on the class of aligner used. Similarity-based aligners tend to identify more adaptive evolution. We also develop and test methods for incorporating MSA uncertainty when detecting adaptive evolution but find that although accounting for MSA uncertainty does affect downstream analyses, it appears less important than the class of aligner chosen. Our results demonstrate the critical role that MSA methodology has on downstream analysis, highlighting that the class of aligner chosen in an analysis has a demonstrable effect on its outcome.
Collapse
|
7
|
|
8
|
Battagliero S, Puglia G, Vicario S, Rubino F, Scioscia G, Leo P. An efficient algorithm for approximating geodesic distances in tree space. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:1196-1207. [PMID: 21116041 DOI: 10.1109/tcbb.2010.121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
The increasing use of phylogeny in biological studies is limited by the need to make available more efficient tools for computing distances between trees. The geodesic tree distance-introduced by Billera, Holmes, and Vogtmann-combines both the tree topology and edge lengths into a single metric. Despite the conceptual simplicity of the geodesic tree distance, algorithms to compute it don't scale well to large, real-world phylogenetic trees composed of hundred or even thousand leaves. In this paper, we propose the geodesic distance as an effective tool for exploring the likelihood profile in the space of phylogenetic trees, and we give a cubic time algorithm, GeoHeuristic, in order to compute an approximation of the distance. We compare it with the GTP algorithm, which calculates the exact distance, and the cone path length, which is another approximation, showing that GeoHeuristic achieves a quite good trade-off between accuracy (relative error always lower than 0.0001) and efficiency. We also prove the equivalence among GeoHeuristic, cone path, and Robinson-Foulds distances when assuming branch lengths equal to unity and we show empirically that, under this restriction, these distances are almost always equal to the actual geodesic.
Collapse
|
9
|
Huber KT, Spillner A, Suchecki R, Moulton V. Metrics on multilabeled trees: interrelationships and diameter bounds. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:1029-1040. [PMID: 21116046 DOI: 10.1109/tcbb.2010.122] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Multilabeled trees or MUL-trees, for short, are trees whose leaves are labeled by elements of some nonempty finite set X such that more than one leaf may be labeled by the same element of X. This class of trees includes phylogenetic trees and tree shapes. MUL-trees arise naturally in, for example, biogeography and gene evolution studies and also in the area of phylogenetic network reconstruction. In this paper, we introduce novel metrics which may be used to compare MUL-trees, most of which generalize well-known metrics on phylogenetic trees and tree shapes. These metrics can be used, for example, to better understand the space of MUL-trees or to help visualize collections of MUL-trees. In addition, we describe some relationships between the MUL-tree metrics that we present and also give some novel diameter bounds for these metrics. We conclude by briefly discussing some open problems as well as pointing out how MUL-tree metrics may be used to define metrics on the space of phylogenetic networks.
Collapse
Affiliation(s)
- Katharina T Huber
- School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, UK.
| | | | | | | |
Collapse
|
10
|
Huggins PM, Li W, Haws D, Friedrich T, Liu J, Yoshida R. Bayes estimators for phylogenetic reconstruction. Syst Biol 2011; 60:528-40. [PMID: 21471560 DOI: 10.1093/sysbio/syr021] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Tree reconstruction methods are often judged by their accuracy, measured by how close they get to the true tree. Yet, most reconstruction methods like maximum likelihood (ML) do not explicitly maximize this accuracy. To address this problem, we propose a Bayesian solution. Given tree samples, we propose finding the tree estimate that is closest on average to the samples. This "median" tree is known as the Bayes estimator (BE). The BE literally maximizes posterior expected accuracy, measured in terms of closeness (distance) to the true tree. We discuss a unified framework of BE trees, focusing especially on tree distances that are expressible as squared euclidean distances. Notable examples include Robinson-Foulds (RF) distance, quartet distance, and squared path difference. Using both simulated and real data, we show that BEs can be estimated in practice by hill-climbing. In our simulation, we find that BEs tend to be closer to the true tree, compared with ML and neighbor joining. In particular, the BE under squared path difference tends to perform well in terms of both path difference and RF distances.
Collapse
Affiliation(s)
- P M Huggins
- Lane Center for Computational Biology, Carnegie Mellon University, Mellon Institute Building, 4400 Fifth Avenue, Pittsburgh, PA 15213, USA
| | | | | | | | | | | |
Collapse
|
11
|
Owen M, Provan JS. A fast algorithm for computing geodesic distances in tree space. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:2-13. [PMID: 21071792 DOI: 10.1109/tcbb.2010.3] [Citation(s) in RCA: 55] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Comparing and computing distances between phylogenetic trees are important biological problems, especially for models where edge lengths play an important role. The geodesic distance measure between two phylogenetic trees with edge lengths is the length of the shortest path between them in the continuous tree space introduced by Billera, Holmes, and Vogtmann. This tree space provides a powerful tool for studying and comparing phylogenetic trees, both in exhibiting a natural distance measure and in providing a euclidean-like structure for solving optimization problems on trees. An important open problem is to find a polynomial time algorithm for finding geodesics in tree space. This paper gives such an algorithm, which starts with a simple initial path and moves through a series of successively shorter paths until the geodesic is attained.
Collapse
Affiliation(s)
- Megan Owen
- Department of Mathematics, University of California, Berkeley, MC 3840, Berkeley, CA 94720-0432, USA.
| | | |
Collapse
|
12
|
Sundberg K, Clement M, Snell Q. On the use of cartographic projections in visualizing phylo-genetic tree space. Algorithms Mol Biol 2010; 5:26. [PMID: 20529355 PMCID: PMC2891783 DOI: 10.1186/1748-7188-5-26] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2009] [Accepted: 06/08/2010] [Indexed: 11/10/2022] Open
Abstract
Phylogenetic analysis is becoming an increasingly important tool for biological research. Applications include epidemiological studies, drug development, and evolutionary analysis. Phylogenetic search is a known NP-Hard problem. The size of the data sets which can be analyzed is limited by the exponential growth in the number of trees that must be considered as the problem size increases. A better understanding of the problem space could lead to better methods, which in turn could lead to the feasible analysis of more data sets. We present a definition of phylogenetic tree space and a visualization of this space that shows significant exploitable structure. This structure can be used to develop search methods capable of handling much larger data sets.
Collapse
|