1
|
Wagle S, Markin A, Górecki P, Anderson TK, Eulenstein O. Asymmetric Cluster-Based Measures for Comparative Phylogenetics. J Comput Biol 2024; 31:312-327. [PMID: 38634854 PMCID: PMC11057527 DOI: 10.1089/cmb.2023.0338] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/19/2024] Open
Abstract
Phylogenetic inference and reconstruction methods generate hypotheses on evolutionary history. Competing inference methods are frequently used, and the evaluation of the generated hypotheses is achieved using tree comparison costs. The Robinson-Foulds (RF) distance is a widely used cost to compare the topology of two trees, but this cost is sensitive to tree error and can overestimate tree differences. To overcome this limitation, a refined version of the RF distance called the Cluster Affinity (CA) distance was introduced. However, CA distances are symmetric and cannot compare different types of trees. These asymmetric comparisons occur when gene trees are compared with species trees, when disparate datasets are integrated into a supertree, or when tree comparison measures are used to infer a phylogenetic network. In this study, we introduce a relaxation of the original Affinity distance to compare heterogeneous trees called the asymmetric CA cost. We also develop a biologically interpretable cost, the Cluster Support cost that normalizes by cluster size across gene trees. The characteristics of these costs are similar to the symmetric CA cost. We describe efficient algorithms, derive the exact diameters, and use these to standardize the cost to be applicable in practice. These costs provide objective, fine-scale, and biologically interpretable values that can assess differences and similarities between phylogenetic trees.
Collapse
Affiliation(s)
- Sanket Wagle
- Department of Computer Science, Iowa State University, Ames, Iowa, USA
| | - Alexey Markin
- National Animal Disease Center, USDA-ARS, Ames, Iowa, USA
| | - Paweł Górecki
- Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warsaw, Poland
| | | | - Oliver Eulenstein
- Department of Computer Science, Iowa State University, Ames, Iowa, USA
| |
Collapse
|
2
|
Burk RD, Mirabello L, DeSalle R. Distinguishing Genetic Drift from Selection in Papillomavirus Evolution. Viruses 2023; 15:1631. [PMID: 37631973 PMCID: PMC10458755 DOI: 10.3390/v15081631] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Revised: 07/20/2023] [Accepted: 07/21/2023] [Indexed: 08/27/2023] Open
Abstract
Pervasive purifying selection on non-synonymous substitutions is a hallmark of papillomavirus genome history, but the role of selection on and the drift of non-coding DNA motifs on HPV diversification is poorly understood. In this study, more than a thousand complete genomes representing Alphapapillomavirus types, lineages, and SNP variants were examined phylogenetically and interrogated for the number and position of non-coding DNA sequence motifs using Principal Components Analyses, Ancestral State Reconstructions, and Phylogenetic Independent Contrasts. For anciently diverged Alphapapillomavirus types, composition of the four nucleotides (A, C, G, T), codon usage, trimer usage, and 13 established non-coding DNA sequence motifs revealed phylogenetic clusters consistent with genetic drift. Ancestral state reconstruction and Phylogenetic Independent Contrasts revealed ancient genome alterations, particularly for the CpG and APOBEC3 motifs. Each evolutionary analytical method we performed supports the unanticipated conclusion that genetic drift and different evolutionary drivers have structured Alphapapillomavirus genomes in distinct ways during successive epochs, even extending to differences in more recently formed variant lineages.
Collapse
Affiliation(s)
- Robert D. Burk
- Departments of Pediatrics, Microbiology & Immunology, Epidemiology & Population Health, Obstetrics, Gynecology and Woman’s Health, and Albert Einstein Cancer Center, Albert Einstein College of Medicine, Bronx, NY 10461, USA
| | - Lisa Mirabello
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Rockville, MD 20850, USA
| | - Robert DeSalle
- Sackler Institute of Comparative Genomics, American Museum of Natural History, New York, NY 10024, USA
| |
Collapse
|
3
|
Bogdanowicz D, Giaro K. Generalization of Phylogenetic Matching Metrics with Experimental Tests of Practical Advantages. J Comput Biol 2023; 30:261-276. [PMID: 36576792 DOI: 10.1089/cmb.2022.0090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
The ability to quantify a dissimilarity of different phylogenetic trees is required in various types of phylogenetic studies, for example, such metrics are used to assess the quality of phylogeny construction methods and to define optimization criteria in supertree building algorithms. In this article, starting from the already described concept of matching metrics, we define three new metrics for rooted phylogenetic trees. One of them, Matching Pair Jaccard (MPJ) distance, is still purely topological, but we now utilize the Jaccard index set dissimilarity measure in its construction. This modification substantially changes the structural features of the metric space. In particular, we investigate the properties of the previously known Matching Cluster Jaccard (MCJ) and the new MPJ metrics, such as the asymptotic behavior of their expected distance between two random trees, the space diameter, and the change of a distance after a single leaf relocation. The other two metrics, Matching Cluster Weight-aware (MCW) and Matching Cluster Jaccard Weight-aware (MCJW) distances, are the first propositions of generalization of matching metrics designed for rooted phylogenies with branch lengths. The experimental tests of the practical utility of the phylogenetic metrics show the superiority of MCJ, MPJ over the previous best tree comparison method. To define the MCW and MCJW metrics, we introduce a general method for constructing matching metrics for weighted rooted phylogenetic trees.
Collapse
Affiliation(s)
- Damian Bogdanowicz
- Department of Algorithms and System Modeling, Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, Gdansk, Poland
| | - Krzysztof Giaro
- Department of Algorithms and System Modeling, Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, Gdansk, Poland
| |
Collapse
|
4
|
Goremykin V. Assessment of Absolute Substitution Model Fit Accommodating Time-Reversible and Non-Time-Reversible Evolutionary Processes. Syst Biol 2022:6632685. [PMID: 35792853 DOI: 10.1093/sysbio/syac046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2021] [Revised: 06/16/2022] [Accepted: 06/24/2022] [Indexed: 11/13/2022] Open
Abstract
The loss of information accompanying assessment of absolute fit of substitution models to phylogenetic data negatively affects the discriminatory power of previous methods and can make them insensitive to lineage-specific changes in the substitution process. As an alternative, I propose evaluating absolute fit of substitution models based on a novel statistic which describes the observed data without information loss and which is unlikely to become zero-inflated with increasing numbers of taxa. This method can accommodate gaps and is sensitive to lineage-specific shifts in the substitution process. In simulation experiments, it exhibits greater discriminatory power than previous methods. The method can be implemented in both Bayesian and Maximum Likelihood phylogenetic analyses, and used to screen any set of models. Recently, it has been suggested that model selection may be an unnecessary step in phylogenetic inference. However, results presented here emphasize the importance of model fit assessment for reliable phylogenetic inference.
Collapse
Affiliation(s)
- Vadim Goremykin
- Research and Innovation Centre, Fondazione Edmund Mach, 38010 San Michele all'Adige (TN), Italy
| |
Collapse
|
5
|
Bhattacharya T, Rice DW, Crawford JM, Hardy RW, Newton ILG. Evidence of Adaptive Evolution in Wolbachia-Regulated Gene DNMT2 and Its Role in the Dipteran Immune Response and Pathogen Blocking. Viruses 2021; 13:1464. [PMID: 34452330 PMCID: PMC8402854 DOI: 10.3390/v13081464] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Revised: 06/09/2021] [Accepted: 07/09/2021] [Indexed: 12/23/2022] Open
Abstract
Eukaryotic nucleic acid methyltransferase (MTase) proteins are essential mediators of epigenetic and epitranscriptomic regulation. DNMT2 belongs to a large, conserved family of DNA MTases found in many organisms, including holometabolous insects such as fruit flies and mosquitoes, where it is the lone MTase. Interestingly, despite its nomenclature, DNMT2 is not a DNA MTase, but instead targets and methylates RNA species. A growing body of literature suggests that DNMT2 mediates the host immune response against a wide range of pathogens, including RNA viruses. Curiously, although DNMT2 is antiviral in Drosophila, its expression promotes virus replication in mosquito species. We, therefore, sought to understand the divergent regulation, function, and evolution of these orthologs. We describe the role of the Drosophila-specific host protein IPOD in regulating the expression and function of fruit fly DNMT2. Heterologous expression of these orthologs suggests that DNMT2's role as an antiviral is host-dependent, indicating a requirement for additional host-specific factors. Finally, we identify and describe potential evidence of positive selection at different times throughout DNMT2 evolution within dipteran insects. We identify specific codons within each ortholog that are under positive selection and find that they are restricted to four distinct protein domains, which likely influence substrate binding, target recognition, and adaptation of unique intermolecular interactions. Collectively, our findings highlight the evolution of DNMT2 in Dipteran insects and point to structural, regulatory, and functional differences between mosquito and fruit fly homologs.
Collapse
Affiliation(s)
- Tamanash Bhattacharya
- Department of Biology, Indiana University Bloomington, Bloomington, IN 47405, USA; (T.B.); (D.W.R.); (J.M.C.)
- Basic Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Danny W. Rice
- Department of Biology, Indiana University Bloomington, Bloomington, IN 47405, USA; (T.B.); (D.W.R.); (J.M.C.)
| | - John M. Crawford
- Department of Biology, Indiana University Bloomington, Bloomington, IN 47405, USA; (T.B.); (D.W.R.); (J.M.C.)
| | - Richard W. Hardy
- Department of Biology, Indiana University Bloomington, Bloomington, IN 47405, USA; (T.B.); (D.W.R.); (J.M.C.)
| | - Irene L. G. Newton
- Department of Biology, Indiana University Bloomington, Bloomington, IN 47405, USA; (T.B.); (D.W.R.); (J.M.C.)
| |
Collapse
|
6
|
Smith MR. Information theoretic generalized Robinson–Foulds metrics for comparing phylogenetic trees. Bioinformatics 2020; 36:5007-5013. [DOI: 10.1093/bioinformatics/btaa614] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2019] [Revised: 06/03/2020] [Accepted: 06/26/2020] [Indexed: 11/15/2022] Open
Abstract
Abstract
Motivation
The Robinson–Foulds (RF) metric is widely used by biologists, linguists and chemists to quantify similarity between pairs of phylogenetic trees. The measure tallies the number of bipartition splits that occur in both trees—but this conservative approach ignores potential similarities between almost-identical splits, with undesirable consequences. ‘Generalized’ RF metrics address this shortcoming by pairing splits in one tree with similar splits in the other. Each pair is assigned a similarity score, the sum of which enumerates the similarity between two trees. The challenge lies in quantifying split similarity: existing definitions lack a principled statistical underpinning, resulting in misleading tree distances that are difficult to interpret. Here, I propose probabilistic measures of split similarity, which allow tree similarity to be measured in natural units (bits).
Results
My new information-theoretic metrics outperform alternative measures of tree similarity when evaluated against a broad suite of criteria, even though they do not account for the non-independence of splits within a single tree. Mutual clustering information exhibits none of the undesirable properties that characterize other tree comparison metrics, and should be preferred to the RF metric.
Availability and implementation
The methods discussed in this article are implemented in the R package ‘TreeDist’, archived at https://dx.doi.org/10.5281/zenodo.3528123.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Martin R Smith
- Department of Earth Sciences, Lower Mountjoy, Durham University, Durham DH1 3LE, UK
| |
Collapse
|
7
|
Goluch T, Bogdanowicz D, Giaro K. Visual TreeCmp
: Comprehensive Comparison of Phylogenetic Trees on the Web. Methods Ecol Evol 2020. [DOI: 10.1111/2041-210x.13358] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Affiliation(s)
- Tomasz Goluch
- Department of Algorithms and System Modeling Faculty of Electronics, Telecommunications and Informatics Gdańsk University of Technology Gdańsk Poland
| | - Damian Bogdanowicz
- Department of Algorithms and System Modeling Faculty of Electronics, Telecommunications and Informatics Gdańsk University of Technology Gdańsk Poland
| | - Krzysztof Giaro
- Department of Algorithms and System Modeling Faculty of Electronics, Telecommunications and Informatics Gdańsk University of Technology Gdańsk Poland
| |
Collapse
|
8
|
Maldonado E, Antunes A. LMAP_S: Lightweight Multigene Alignment and Phylogeny eStimation. BMC Bioinformatics 2019; 20:739. [PMID: 31888452 PMCID: PMC6937843 DOI: 10.1186/s12859-019-3292-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2019] [Accepted: 11/26/2019] [Indexed: 01/22/2023] Open
Abstract
Background Recent advances in genome sequencing technologies and the cost drop in high-throughput sequencing continue to give rise to a deluge of data available for downstream analyses. Among others, evolutionary biologists often make use of genomic data to uncover phenotypic diversity and adaptive evolution in protein-coding genes. Therefore, multiple sequence alignments (MSA) and phylogenetic trees (PT) need to be estimated with optimal results. However, the preparation of an initial dataset of multiple sequence file(s) (MSF) and the steps involved can be challenging when considering extensive amount of data. Thus, it becomes necessary the development of a tool that removes the potential source of error and automates the time-consuming steps of a typical workflow with high-throughput and optimal MSA and PT estimations. Results We introduce LMAP_S (Lightweight Multigene Alignment and Phylogeny eStimation), a user-friendly command-line and interactive package, designed to handle an improved alignment and phylogeny estimation workflow: MSF preparation, MSA estimation, outlier detection, refinement, consensus, phylogeny estimation, comparison and editing, among which file and directory organization, execution, manipulation of information are automated, with minimal manual user intervention. LMAP_S was developed for the workstation multi-core environment and provides a unique advantage for processing multiple datasets. Our software, proved to be efficient throughout the workflow, including, the (unlimited) handling of more than 20 datasets. Conclusions We have developed a simple and versatile LMAP_S package enabling researchers to effectively estimate multiple datasets MSAs and PTs in a high-throughput fashion. LMAP_S integrates more than 25 software providing overall more than 65 algorithm choices distributed in five stages. At minimum, one FASTA file is required within a single input directory. To our knowledge, no other software combines MSA and phylogeny estimation with as many alternatives and provides means to find optimal MSAs and phylogenies. Moreover, we used a case study comparing methodologies that highlighted the usefulness of our software. LMAP_S has been developed as an open-source package, allowing its integration into more complex open-source bioinformatics pipelines. LMAP_S package is released under GPLv3 license and is freely available at https://lmap-s.sourceforge.io/.
Collapse
Affiliation(s)
- Emanuel Maldonado
- CIIMAR/CIMAR - Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de Matos, s/n, 4450-208, Porto, Portugal
| | - Agostinho Antunes
- CIIMAR/CIMAR - Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de Matos, s/n, 4450-208, Porto, Portugal. .,Department of Biology, Faculty of Sciences, University of Porto, Rua do Campo Alegre, 4169-007, Porto, Portugal.
| |
Collapse
|
9
|
Puigbò P, Wolf YI, Koonin EV. Genome-Wide Comparative Analysis of Phylogenetic Trees: The Prokaryotic Forest of Life. Methods Mol Biol 2019; 1910:241-269. [PMID: 31278667 DOI: 10.1007/978-1-4939-9074-0_8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Genome-wide comparison of phylogenetic trees is becoming an increasingly common approach in evolutionary genomics, and a variety of approaches for such comparison have been developed. In this article we present several methods for comparative analysis of large numbers of phylogenetic trees. To compare phylogenetic trees taking into account the bootstrap support for each internal branch, the boot-split distance (BSD) method is introduced as an extension of the previously developed split distance (SD) method for tree comparison. The BSD method implements the straightforward idea that comparison of phylogenetic trees can be made more robust by treating tree splits differentially depending on the bootstrap support. Approaches are also introduced for detecting treelike and netlike evolutionary trends in the phylogenetic Forest of Life (FOL), i.e., the entirety of the phylogenetic trees for conserved genes of prokaryotes. The principal method employed for this purpose includes mapping quartets of species onto trees to calculate the support of each quartet topology and so to quantify the tree and net contributions to the distances between species. We describe the applications methods used to analyze the FOL and the results obtained with these methods. These results support the concept of the Tree of Life (TOL) as a central evolutionary trend in the FOL as opposed to the traditional view of the TOL as a "species tree."
Collapse
Affiliation(s)
- Pere Puigbò
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.,Division of Genetics and Physiology, Department of Biology, University of Turku, Turku, Finland
| | - Yuri I Wolf
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Eugene V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|