1
|
Banos H, Wong TKF, Daneau J, Susko E, Minh BQ, Lanfear R, Brown MW, Eme L, Roger AJ. GTRpmix: A Linked General Time-Reversible Model for Profile Mixture Models. Mol Biol Evol 2024; 41:msae174. [PMID: 39158305 PMCID: PMC11371462 DOI: 10.1093/molbev/msae174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Revised: 06/25/2024] [Accepted: 08/12/2024] [Indexed: 08/20/2024] Open
Abstract
Profile mixture models capture distinct biochemical constraints on the amino acid substitution process at different sites in proteins. These models feature a mixture of time-reversible models with a common matrix of exchangeabilities and distinct sets of equilibrium amino acid frequencies known as profiles. Combining the exchangeability matrix with each profile generates the matrix of instantaneous rates of amino acid exchange for that profile. Currently, empirically estimated exchangeability matrices (e.g. the LG matrix) are widely used for phylogenetic inference under profile mixture models. However, these were estimated using a single profile and are unlikely optimal for profile mixture models. Here, we describe the GTRpmix model that allows maximum likelihood estimation of a common exchangeability matrix under any profile mixture model. We show that exchangeability matrices estimated under profile mixture models differ from the LG matrix, dramatically improving model fit and topological estimation accuracy for empirical test cases. Because the GTRpmix model is computationally expensive, we provide two exchangeability matrices estimated from large concatenated phylogenomic-supermatrices to be used for phylogenetic analyses. One, called Eukaryotic Linked Mixture (ELM), is designed for phylogenetic analysis of proteins encoded by nuclear genomes of eukaryotes, and the other, Eukaryotic and Archaeal Linked mixture (EAL), for reconstructing relationships between eukaryotes and Archaea. These matrices, combined with profile mixture models, fit data better and have improved topology estimation relative to the LG matrix combined with the same mixture models. Starting with version 2.3.1, IQ-TREE2 allows users to estimate linked exchangeabilities (i.e. amino acid exchange rates) under profile mixture models.
Collapse
Affiliation(s)
- Hector Banos
- Department of Mathematics, California State University San Bernardino, San Bernardino, CA, USA
- Department of Biochemistry and Molecular Biology, Faculty of Medicine, Dalhousie University, Halifax, NS, Canada
| | - Thomas K F Wong
- School of Computing, College of Engineering and Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia
- Ecology and Evolution, Research School of Biology, College of Science, Australian National University, Canberra, ACT 2600, Australia
| | - Justin Daneau
- Department of Biochemistry and Molecular Biology, Faculty of Medicine, Dalhousie University, Halifax, NS, Canada
| | - Edward Susko
- Department of Mathematics and Statistics, Faculty of Science, Dalhousie University, Halifax, NS, Canada
| | - Bui Quang Minh
- School of Computing, College of Engineering and Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia
| | - Robert Lanfear
- Ecology and Evolution, Research School of Biology, College of Science, Australian National University, Canberra, ACT 2600, Australia
| | - Matthew W Brown
- Department of Biological Sciences, Mississippi State University, Mississippi State, MS, USA
| | - Laura Eme
- Laboratoire d’Ecologie, systématique et Evolution, Université Paris-Saclay, Gif-sur-Yvette, France
| | - Andrew J Roger
- Department of Biochemistry and Molecular Biology, Faculty of Medicine, Dalhousie University, Halifax, NS, Canada
| |
Collapse
|
2
|
Redelings BD, Holmes I, Lunter G, Pupko T, Anisimova M. Insertions and Deletions: Computational Methods, Evolutionary Dynamics, and Biological Applications. Mol Biol Evol 2024; 41:msae177. [PMID: 39172750 PMCID: PMC11385596 DOI: 10.1093/molbev/msae177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2024] [Revised: 07/02/2024] [Accepted: 07/09/2024] [Indexed: 08/24/2024] Open
Abstract
Insertions and deletions constitute the second most important source of natural genomic variation. Insertions and deletions make up to 25% of genomic variants in humans and are involved in complex evolutionary processes including genomic rearrangements, adaptation, and speciation. Recent advances in long-read sequencing technologies allow detailed inference of insertions and deletion variation in species and populations. Yet, despite their importance, evolutionary studies have traditionally ignored or mishandled insertions and deletions due to a lack of comprehensive methodologies and statistical models of insertions and deletion dynamics. Here, we discuss methods for describing insertions and deletion variation and modeling insertions and deletions over evolutionary time. We provide practical advice for tackling insertions and deletions in genomic sequences and illustrate our discussion with examples of insertions and deletion-induced effects in human and other natural populations and their contribution to evolutionary processes. We outline promising directions for future developments in statistical methodologies that would allow researchers to analyze insertions and deletion variation and their effects in large genomic data sets and to incorporate insertions and deletions in evolutionary inference.
Collapse
Affiliation(s)
| | - Ian Holmes
- Department of Bioengineering, University of California, Berkeley, CA 94720, USA
- Calico Life Sciences LLC, South San Francisco, CA 94080, USA
| | - Gerton Lunter
- Department of Epidemiology, University Medical Center Groningen, University of Groningen, Groningen 9713 GZ, The Netherlands
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Maria Anisimova
- Institute of Computational Life Sciences, Zurich University of Applied Sciences, Wädenswil, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
3
|
Redmond AK. Acoelomorph flatworm monophyly is a long-branch attraction artefact obscuring a clade of Acoela and Xenoturbellida. Proc Biol Sci 2024; 291:20240329. [PMID: 39288803 PMCID: PMC11407873 DOI: 10.1098/rspb.2024.0329] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Revised: 06/27/2024] [Accepted: 07/30/2024] [Indexed: 09/19/2024] Open
Abstract
Acoelomorpha is a broadly accepted clade of bilaterian animals made up of the fast-evolving, morphologically simple, mainly marine flatworm lineages Acoela and Nemertodermatida. Phylogenomic studies support Acoelomorpha's close relationship with the slowly evolving and similarly simplistic Xenoturbella, together forming the phylum Xenacoelomorpha. The phylogenetic placement of Xenacoelomorpha amongst bilaterians is controversial, with some studies supporting Xenacoelomorpha as the sister group to all other bilaterians, implying that their simplicity may be representative of early bilaterians. Others propose that this placement is an error resulting from the fast-evolving Acoelomorpha, and instead suggest that they are the degenerate sister group to Ambulacraria. Perhaps as a result of this debate, internal xenacoelomorph relationships have been somewhat overlooked at a phylogenomic scale. Here, I employ a highly targeted approach to detect and overcome possible phylogenomic error in the relationship between Xenoturbella and the fast-evolving acoelomorph flatworms. The results indicate that the subphylum Acoelomorpha is a long-branch attraction artefact obscuring a previously undiscovered clade comprising Xenoturbella and Acoela, which I name Xenacoela. The findings also suggest that Xenacoelomorpha is not the sister group to all other bilaterians. This study provides a template for future efforts aimed at discovering and correcting unrecognized long-branch attraction artefacts throughout the tree of life.
Collapse
|
4
|
Kar C, Raghavan R, Ummath A, Puthiyaalikom N, Idreesbabu KK, Sureshkumar S. Resolving fusilier puzzles: The identity of Squamosicaesio marri and Pterocaesio flavifasciata, and a new record of Flavicaesio suevica from the Western Indian Ocean. JOURNAL OF FISH BIOLOGY 2024; 105:993-997. [PMID: 38811354 DOI: 10.1111/jfb.15808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Revised: 04/24/2024] [Accepted: 05/09/2024] [Indexed: 05/31/2024]
Abstract
A phylogenetic analysis incorporating mitochondrial cox1 gene sequences of members of the family Caesionidae revealed the conspecificity of Pterocaesio flavifasciata and Squamosicaesio marri, which was also supported by the absence of any clear morphological diagnostic characters and meristic counts to separate the two species. Additionally, we provide the first record of the Suez fusilier, Flavicaesio suevica, from outside the Red Sea, based on specimens collected from the Laccadive archipelago, Western Indian Ocean. Together, these results show that the taxonomy, diversity, and distribution of members of the family Caesionidae continue to be poorly known, necessitating a comprehensive range-wide study.
Collapse
Affiliation(s)
- Chinmay Kar
- Biodiversity Laboratory, Faculty of Ocean Science and Technology, Kerala University of Fisheries and Ocean Studies (KUFOS), Kochi, India
| | - Rajeev Raghavan
- Department of Fisheries Resource Management, Kerala University of Fisheries and Ocean Studies (KUFOS), Kochi, India
| | - Ameen Ummath
- Department of Ocean Studies and Marine Biology, Pondicherry University, Port Blair, India
| | - Naseeba Puthiyaalikom
- Department of Science and Technology, Union Terrirtory of Lakshadweep, Kavaratti, India
| | | | - Sivanpillai Sureshkumar
- Biodiversity Laboratory, Faculty of Ocean Science and Technology, Kerala University of Fisheries and Ocean Studies (KUFOS), Kochi, India
| |
Collapse
|
5
|
Berv JS, Singhal S, Field DJ, Walker-Hale N, McHugh SW, Shipley JR, Miller ET, Kimball RT, Braun EL, Dornburg A, Parins-Fukuchi CT, Prum RO, Winger BM, Friedman M, Smith SA. Genome and life-history evolution link bird diversification to the end-Cretaceous mass extinction. SCIENCE ADVANCES 2024; 10:eadp0114. [PMID: 39083615 PMCID: PMC11290531 DOI: 10.1126/sciadv.adp0114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/02/2024] [Accepted: 06/28/2024] [Indexed: 08/02/2024]
Abstract
Complex patterns of genome evolution associated with the end-Cretaceous [Cretaceous-Paleogene (K-Pg)] mass extinction limit our understanding of the early evolutionary history of modern birds. Here, we analyzed patterns of avian molecular evolution and identified distinct macroevolutionary regimes across exons, introns, untranslated regions, and mitochondrial genomes. Bird clades originating near the K-Pg boundary exhibited numerous shifts in the mode of molecular evolution, suggesting a burst of genomic heterogeneity at this point in Earth's history. These inferred shifts in substitution patterns were closely related to evolutionary shifts in developmental mode, adult body mass, and patterns of metabolic scaling. Our results suggest that the end-Cretaceous mass extinction triggered integrated patterns of evolution across avian genomes, physiology, and life history near the dawn of the modern bird radiation.
Collapse
Affiliation(s)
- Jacob S. Berv
- Department of Ecology and Evolutionary Biology, University of Michigan, 1105 North University Avenue, Biological Sciences Building, University of Michigan, Ann Arbor, MI 48109, USA
- Museum of Paleontology, University of Michigan, 1105 North University Avenue, Biological Sciences Building, University of Michigan, Ann Arbor, MI 48109, USA
- Museum of Zoology, University of Michigan, 1105 North University Avenue, Biological Sciences Building, University of Michigan, Ann Arbor, MI 48109, USA
| | - Sonal Singhal
- Department of Biology, California State University, Dominguez Hills, Carson, CA 90747, USA
| | - Daniel J. Field
- Department of Earth Sciences, University of Cambridge, Downing Street, Cambridge CB2 3EQ, UK
- Museum of Zoology, University of Cambridge, Downing Street, Cambridge CB2 3EJ, UK
| | - Nathanael Walker-Hale
- Department of Plant Sciences, University of Cambridge, Downing Street, Cambridge CB2 3EA, UK
| | - Sean W. McHugh
- Department of Evolution, Ecology, and Population Biology, Washington University in St. Louis, St. Louis, MO 63130, USA
| | - J. Ryan Shipley
- Department of Forest Dynamics, Swiss Federal Institute for Forest, Snow, and Landscape Research WSL, Zürcherstrasse 111 8903, Birmensdorf, Switzerland
| | - Eliot T. Miller
- Center for Avian Population Studies, Cornell Lab of Ornithology, Cornell University, Ithaca, NY 14850, USA
| | - Rebecca T. Kimball
- Department of Biology, University of Florida, Gainesville, FL 32611, USA
| | - Edward L. Braun
- Department of Biology, University of Florida, Gainesville, FL 32611, USA
| | - Alex Dornburg
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - C. Tomomi Parins-Fukuchi
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Ontario M5S 3B2, Canada
| | - Richard O. Prum
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT 06520, USA
- Peabody Museum of Natural History, Yale University, New Haven, CT 06520, USA
| | - Benjamin M. Winger
- Department of Ecology and Evolutionary Biology, University of Michigan, 1105 North University Avenue, Biological Sciences Building, University of Michigan, Ann Arbor, MI 48109, USA
- Museum of Zoology, University of Michigan, 1105 North University Avenue, Biological Sciences Building, University of Michigan, Ann Arbor, MI 48109, USA
| | - Matt Friedman
- Museum of Paleontology, University of Michigan, 1105 North University Avenue, Biological Sciences Building, University of Michigan, Ann Arbor, MI 48109, USA
- Department of Earth and Environmental Sciences, University of Michigan, 1100 North University Avenue, University of Michigan, Ann Arbor, MI 48109, USA
| | - Stephen A. Smith
- Department of Ecology and Evolutionary Biology, University of Michigan, 1105 North University Avenue, Biological Sciences Building, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
6
|
Suvorov A, Schrider DR. Reliable estimation of tree branch lengths using deep neural networks. PLoS Comput Biol 2024; 20:e1012337. [PMID: 39102450 PMCID: PMC11326709 DOI: 10.1371/journal.pcbi.1012337] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Revised: 08/15/2024] [Accepted: 07/18/2024] [Indexed: 08/07/2024] Open
Abstract
A phylogenetic tree represents hypothesized evolutionary history for a set of taxa. Besides the branching patterns (i.e., tree topology), phylogenies contain information about the evolutionary distances (i.e. branch lengths) between all taxa in the tree, which include extant taxa (external nodes) and their last common ancestors (internal nodes). During phylogenetic tree inference, the branch lengths are typically co-estimated along with other phylogenetic parameters during tree topology space exploration. There are well-known regions of the branch length parameter space where accurate estimation of phylogenetic trees is especially difficult. Several novel studies have recently demonstrated that machine learning approaches have the potential to help solve phylogenetic problems with greater accuracy and computational efficiency. In this study, as a proof of concept, we sought to explore the possibility of machine learning models to predict branch lengths. To that end, we designed several deep learning frameworks to estimate branch lengths on fixed tree topologies from multiple sequence alignments or its representations. Our results show that deep learning methods can exhibit superior performance in some difficult regions of branch length parameter space. For example, in contrast to maximum likelihood inference, which is typically used for estimating branch lengths, deep learning methods are more efficient and accurate. In general, we find that our neural networks achieve similar accuracy to a Bayesian approach and are the best-performing methods when inferring long branches that are associated with distantly related taxa. Together, our findings represent a next step toward accurate, fast, and reliable phylogenetic inference with machine learning approaches.
Collapse
Affiliation(s)
- Anton Suvorov
- Department of Biological Sciences, Virginia Tech, Blacksburg, Virginia, United States of America
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| | - Daniel R Schrider
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| |
Collapse
|
7
|
Wong TKF, Cherryh C, Rodrigo AG, Hahn MW, Minh BQ, Lanfear R. MAST: Phylogenetic Inference with Mixtures Across Sites and Trees. Syst Biol 2024; 73:375-391. [PMID: 38421146 PMCID: PMC11282360 DOI: 10.1093/sysbio/syae008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2022] [Revised: 12/18/2023] [Accepted: 02/27/2024] [Indexed: 03/02/2024] Open
Abstract
Hundreds or thousands of loci are now routinely used in modern phylogenomic studies. Concatenation approaches to tree inference assume that there is a single topology for the entire dataset, but different loci may have different evolutionary histories due to incomplete lineage sorting (ILS), introgression, and/or horizontal gene transfer; even single loci may not be treelike due to recombination. To overcome this shortcoming, we introduce an implementation of a multi-tree mixture model that we call mixtures across sites and trees (MAST). This model extends a prior implementation by Boussau et al. (2009) by allowing users to estimate the weight of each of a set of pre-specified bifurcating trees in a single alignment. The MAST model allows each tree to have its own weight, topology, branch lengths, substitution model, nucleotide or amino acid frequencies, and model of rate heterogeneity across sites. We implemented the MAST model in a maximum-likelihood framework in the popular phylogenetic software, IQ-TREE. Simulations show that we can accurately recover the true model parameters, including branch lengths and tree weights for a given set of tree topologies, under a wide range of biologically realistic scenarios. We also show that we can use standard statistical inference approaches to reject a single-tree model when data are simulated under multiple trees (and vice versa). We applied the MAST model to multiple primate datasets and found that it can recover the signal of ILS in the Great Apes, as well as the asymmetry in minor trees caused by introgression among several macaque species. When applied to a dataset of 4 Platyrrhine species for which standard concatenated maximum likelihood (ML) and gene tree approaches disagree, we observe that MAST gives the highest weight (i.e., the largest proportion of sites) to the tree also supported by gene tree approaches. These results suggest that the MAST model is able to analyze a concatenated alignment using ML while avoiding some of the biases that come with assuming there is only a single tree. We discuss how the MAST model can be extended in the future.
Collapse
Affiliation(s)
- Thomas K F Wong
- School of Computing, Australian National University, Canberra, ACT 2601, Australia
| | - Caitlin Cherryh
- Research School of Biology, Australian National University, Canberra, ACT 2601, Australia
| | - Allen G Rodrigo
- School of Biological Sciences, University of Auckland, Auckland 1142, New Zealand
| | - Matthew W Hahn
- Department of Biology and Department of Computer Science, Indiana University, Bloomington, Indiana 47405, USA
| | - Bui Quang Minh
- School of Computing, Australian National University, Canberra, ACT 2601, Australia
| | - Robert Lanfear
- Research School of Biology, Australian National University, Canberra, ACT 2601, Australia
| |
Collapse
|
8
|
Fleming J, Eriksen PM, Struck TH. Scoutknife: A naïve, whole genome informed phylogenetic robusticity metric. F1000Res 2024; 12:945. [PMID: 38799242 PMCID: PMC11128044 DOI: 10.12688/f1000research.139356.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 07/05/2024] [Indexed: 05/29/2024] Open
Abstract
Background: The phylogenetic bootstrap, first proposed by Felsenstein in 1985, is a critically important statistical method in assessing the robusticity of phylogenetic datasets. Core to its concept was the use of pseudo sampling - assessing the data by generating new replicates derived from the initial dataset that was used to generate the phylogeny. In this way, phylogenetic support metrics could overcome the lack of perfect, infinite data. With infinite data, however, it is possible to sample smaller replicates directly from the data to obtain both the phylogeny and its statistical robusticity in the same analysis. Due to the growth of whole genome sequencing, the depth and breadth of our datasets have greatly expanded and are set to only expand further. With genome-scale datasets comprising thousands of genes, we can now obtain a proxy for infinite data. Accordingly, we can potentially abandon the notion of pseudo sampling and instead randomly sample small subsets of genes from the thousands of genes in our analyses. Methods: We introduce Scoutknife, a jackknife-style subsampling implementation that generates 100 datasets by randomly sampling a small number of genes from an initial large-gene dataset to jointly establish both a phylogenetic hypothesis and assess its robusticity. We assess its effectiveness by using 18 previously published datasets and 100 simulation studies. Results: We show that Scoutknife is conservative and informative as to conflicts and incongruence across the whole genome, without the need for subsampling based on traditional model selection criteria. Conclusions: Scoutknife reliably achieves comparable results to selecting the best genes on both real and simulation datasets, while being resistant to the potential biases caused by selecting for model fit. As the amount of genome data grows, it becomes an even more exciting option to assess the robusticity of phylogenetic hypotheses.
Collapse
Affiliation(s)
- James Fleming
- Natural History Museum, Universitetet i Oslo, Oslo, Oslo, 0562, Norway
| | | | | |
Collapse
|
9
|
Efimenko B, Popadin K, Gunbin K. NeMu: a comprehensive pipeline for accurate reconstruction of neutral mutation spectra from evolutionary data. Nucleic Acids Res 2024; 52:W108-W115. [PMID: 38795067 PMCID: PMC11223800 DOI: 10.1093/nar/gkae438] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Revised: 04/23/2024] [Accepted: 05/09/2024] [Indexed: 05/27/2024] Open
Abstract
The recognized importance of mutational spectra in molecular evolution is yet to be fully exploited beyond human cancer studies and model organisms. The wealth of intraspecific polymorphism data in the GenBank repository, covering a broad spectrum of genes and species, presents an untapped opportunity for detailed mutational spectrum analysis. Existing methods fall short by ignoring intermediate substitutions on the inner branches of phylogenetic trees and lacking the capability for cross-species mutational comparisons. To address these challenges, we present the NeMu pipeline, available at https://nemu-pipeline.com, a tool grounded in phylogenetic principles designed to provide comprehensive and scalable analysis of mutational spectra. Utilizing extensive sequence data from numerous available genome projects, NeMu rapidly and accurately reconstructs the neutral mutational spectrum. This tool, facilitating the reconstruction of gene- and species-specific mutational spectra, contributes to a deeper understanding of evolutionary mechanisms across the broad spectrum of known species.
Collapse
Affiliation(s)
- Bogdan Efimenko
- Center for Mitochondrial Functional Genomics, Immanuel Kant Baltic Federal University, Kaliningrad, Russia
- A.A. Kharkevich Institute for Information Transmission Problems RAS, Moscow, Russia
| | - Konstantin Popadin
- Center for Mitochondrial Functional Genomics, Immanuel Kant Baltic Federal University, Kaliningrad, Russia
| | - Konstantin Gunbin
- Center for Mitochondrial Functional Genomics, Immanuel Kant Baltic Federal University, Kaliningrad, Russia
- A.A. Kharkevich Institute for Information Transmission Problems RAS, Moscow, Russia
- Institute of Molecular and Cellular Biology SB RAS, Novosibirsk, Russia
| |
Collapse
|
10
|
Ecker N, Huchon D, Mansour Y, Mayrose I, Pupko T. A machine-learning-based alternative to phylogenetic bootstrap. Bioinformatics 2024; 40:i208-i217. [PMID: 38940166 PMCID: PMC11211842 DOI: 10.1093/bioinformatics/btae255] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION Currently used methods for estimating branch support in phylogenetic analyses often rely on the classic Felsenstein's bootstrap, parametric tests, or their approximations. As these branch support scores are widely used in phylogenetic analyses, having accurate, fast, and interpretable scores is of high importance. RESULTS Here, we employed a data-driven approach to estimate branch support values with a probabilistic interpretation. To this end, we simulated thousands of realistic phylogenetic trees and the corresponding multiple sequence alignments. Each of the obtained alignments was used to infer the phylogeny using state-of-the-art phylogenetic inference software, which was then compared to the true tree. Using these extensive data, we trained machine-learning algorithms to estimate branch support values for each bipartition within the maximum-likelihood trees obtained by each software. Our results demonstrate that our model provides fast and more accurate probability-based branch support values than commonly used procedures. We demonstrate the applicability of our approach on empirical datasets. AVAILABILITY AND IMPLEMENTATION The data supporting this work are available in the Figshare repository at https://doi.org/10.6084/m9.figshare.25050554.v1, and the underlying code is accessible via GitHub at https://github.com/noaeker/bootstrap_repo.
Collapse
Affiliation(s)
- Noa Ecker
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Dorothée Huchon
- School of Zoology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
- The Steinhardt Museum of Natural History and National Research Center, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Yishay Mansour
- The Blavatnik School of Computer Science, Raymond & Beverly Sackler Faculty of Exact Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Itay Mayrose
- School of Plant Sciences and Food Security, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
| |
Collapse
|
11
|
Baños H, Susko E, Roger AJ. Is Over-parameterization a Problem for Profile Mixture Models? Syst Biol 2024; 73:53-75. [PMID: 37843172 PMCID: PMC11129589 DOI: 10.1093/sysbio/syad063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2022] [Revised: 09/12/2023] [Accepted: 10/13/2023] [Indexed: 10/17/2023] Open
Abstract
Biochemical constraints on the admissible amino acids at specific sites in proteins lead to heterogeneity of the amino acid substitution process over sites in alignments. It is well known that phylogenetic models of protein sequence evolution that do not account for site heterogeneity are prone to long-branch attraction (LBA) artifacts. Profile mixture models were developed to model heterogeneity of preferred amino acids at sites via a finite distribution of site classes each with a distinct set of equilibrium amino acid frequencies. However, it is unknown whether the large number of parameters in such models associated with the many amino acid frequency vectors can adversely affect tree topology estimates because of over-parameterization. Here, we demonstrate theoretically that for long sequences, over-parameterization does not create problems for estimation with profile mixture models. Under mild conditions, tree, amino acid frequencies, and other model parameters converge to true values as sequence length increases, even when there are large numbers of components in the frequency profile distributions. Because large sample theory does not necessarily imply good behavior for shorter alignments we explore the performance of these models with short alignments simulated with tree topologies that are prone to LBA artifacts. We find that over-parameterization is not a problem for complex profile mixture models even when there are many amino acid frequency vectors. In fact, simple models with few site classes behave poorly. Interestingly, we also found that misspecification of the amino acid frequency vectors does not lead to increased LBA artifacts as long as the estimated cumulative distribution function of the amino acid frequencies at sites adequately approximates the true one. In contrast, misspecification of the amino acid exchangeability rates can severely negatively affect parameter estimation. Finally, we explore the effects of including in the profile mixture model an additional "F-class" representing the overall frequencies of amino acids in the data set. Surprisingly, the F-class does not help parameter estimation significantly and can decrease the probability of correct tree estimation, depending on the scenario, even though it tends to improve likelihood scores.
Collapse
Affiliation(s)
- Hector Baños
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
- Institute for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
| | - Edward Susko
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
- Institute for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
| | - Andrew J Roger
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
- Institute for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
| |
Collapse
|
12
|
Bossert S, Pauly A, Danforth BN, Orr MC, Murray EA. Lessons from assembling UCEs: A comparison of common methods and the case of Clavinomia (Halictidae). Mol Ecol Resour 2024; 24:e13925. [PMID: 38183389 DOI: 10.1111/1755-0998.13925] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Revised: 12/08/2023] [Accepted: 12/21/2023] [Indexed: 01/08/2024]
Abstract
Sequence data assembly is a foundational step in high-throughput sequencing, with untold consequences for downstream analyses. Despite this, few studies have interrogated the many methods for assembling phylogenomic UCE data for their comparative efficacy, or for how outputs may be impacted. We study this by comparing the most commonly used assembly methods for UCEs in the under-studied bee lineage Nomiinae and a representative sampling of relatives. Data for 63 UCE-only and 75 mixed taxa were assembled with five methods, including ABySS, HybPiper, SPAdes, Trinity and Velvet, and then benchmarked for their relative performance in terms of locus capture parameters and phylogenetic reconstruction. Unexpectedly, Trinity and Velvet trailed the other methods in terms of locus capture and DNA matrix density, whereas SPAdes performed favourably in most assessed metrics. In comparison with SPAdes, the guided-assembly approach HybPiper generally recovered the highest quality loci but in lower numbers. Based on our results, we formally move Clavinomia to Dieunomiini and render Epinomia once more a subgenus of Dieunomia. We strongly advise that future studies more closely examine the influence of assembly approach on their results, or, minimally, use better-performing assembly methods such as SPAdes or HybPiper. In this way, we can move forward with phylogenomic studies in a more standardized, comparable manner.
Collapse
Affiliation(s)
- Silas Bossert
- Department of Entomology, Washington State University, Pullman, Washington, USA
- Department of Entomology, National Museum of Natural History, Smithsonian Institution, Washington, DC, USA
| | - Alain Pauly
- Royal Belgian Institute of Natural Sciences, O.D. Taxonomy and Phylogeny, Brussels, Belgium
| | - Bryan N Danforth
- Department of Entomology, Cornell University, Ithaca, New York, USA
| | - Michael C Orr
- Entomologie, Staatliches Museum für Naturkunde Stuttgart, Stuttgart, Germany
| | - Elizabeth A Murray
- Department of Entomology, Washington State University, Pullman, Washington, USA
| |
Collapse
|
13
|
Jiang Z, Zang W, Ericson PGP, Song G, Wu S, Feng S, Drovetski SV, Liu G, Zhang D, Saitoh T, Alström P, Edwards SV, Lei F, Qu Y. Gene flow and an anomaly zone complicate phylogenomic inference in a rapidly radiated avian family (Prunellidae). BMC Biol 2024; 22:49. [PMID: 38413944 PMCID: PMC10900574 DOI: 10.1186/s12915-024-01848-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2023] [Accepted: 02/15/2024] [Indexed: 02/29/2024] Open
Abstract
BACKGROUND Resolving the phylogeny of rapidly radiating lineages presents a challenge when building the Tree of Life. An Old World avian family Prunellidae (Accentors) comprises twelve species that rapidly diversified at the Pliocene-Pleistocene boundary. RESULTS Here we investigate the phylogenetic relationships of all species of Prunellidae using a chromosome-level de novo assembly of Prunella strophiata and 36 high-coverage resequenced genomes. We use homologous alignments of thousands of exonic and intronic loci to build the coalescent and concatenated phylogenies and recover four different species trees. Topology tests show a large degree of gene tree-species tree discordance but only 40-54% of intronic gene trees and 36-75% of exonic genic trees can be explained by incomplete lineage sorting and gene tree estimation errors. Estimated branch lengths for three successive internal branches in the inferred species trees suggest the existence of an empirical anomaly zone. The most common topology recovered for species in this anomaly zone was not similar to any coalescent or concatenated inference phylogenies, suggesting presence of anomalous gene trees. However, this interpretation is complicated by the presence of gene flow because extensive introgression was detected among these species. When exploring tree topology distributions, introgression, and regional variation in recombination rate, we find that many autosomal regions contain signatures of introgression and thus may mislead phylogenetic inference. Conversely, the phylogenetic signal is concentrated to regions with low-recombination rate, such as the Z chromosome, which are also more resistant to interspecific introgression. CONCLUSIONS Collectively, our results suggest that phylogenomic inference should consider the underlying genomic architecture to maximize the consistency of phylogenomic signal.
Collapse
Affiliation(s)
- Zhiyong Jiang
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Wenqing Zang
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Per G P Ericson
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, PO Box 50007, Stockholm, SE-104 05, Sweden
| | - Gang Song
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | - Shaoyuan Wu
- Jiangsu International Joint Center of Genomics, Jiangsu Key Laboratory of Phylogenomics & Comparative Genomics, School of Life Sciences, Jiangsu Normal University, Xuzhou, 221116, Jiangsu, China
| | - Shaohong Feng
- Center for Evolutionary & Organismal Biology, Zhejiang University School of Medicine, Hangzhou, 310058, China
- Liangzhu Laboratory, Zhejiang University, 1369 West Wenyi Road, Hangzhou, 311121, China
- Innovation Center of Yangtze River Delta, Zhejiang University, Jiashan, 314102, China
| | - Sergei V Drovetski
- National Museum of Natural History, Smithsonian Institution, Washington, DC, 20004, USA
- Present address: U.S. Geological Survey, Eastern Ecological Science Center at Patuxent Research Refuge, Laurel, MD, 20708, USA
| | - Gang Liu
- Chinese Academy of Forestry, Institute of Ecological Conservation and Restoration, Beijing, 100091, China
| | - Dezhi Zhang
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | - Takema Saitoh
- Yamashina Institute for Ornithology, Abiko, Chiba, Japan
| | - Per Alström
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- Animal Ecology, Department of Ecology and Genetics, Evolutionary Biology Centre, Uppsala University, Norbyvägen 18 D, 752 36, Uppsala, Sweden
| | - Scott V Edwards
- Museum of Comparative Zoology and Department of Organismic & Evolutionary Biology, Harvard University, 26 Oxford Street, Cambridge, MA, 02138, USA
| | - Fumin Lei
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Yanhua Qu
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China.
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China.
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, PO Box 50007, Stockholm, SE-104 05, Sweden.
| |
Collapse
|
14
|
Trost J, Haag J, Höhler D, Jacob L, Stamatakis A, Boussau B. Simulations of Sequence Evolution: How (Un)realistic They Are and Why. Mol Biol Evol 2024; 41:msad277. [PMID: 38124381 PMCID: PMC10768886 DOI: 10.1093/molbev/msad277] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 11/17/2023] [Accepted: 12/08/2023] [Indexed: 12/23/2023] Open
Abstract
MOTIVATION Simulating multiple sequence alignments (MSAs) using probabilistic models of sequence evolution plays an important role in the evaluation of phylogenetic inference tools and is crucial to the development of novel learning-based approaches for phylogenetic reconstruction, for instance, neural networks. These models and the resulting simulated data need to be as realistic as possible to be indicative of the performance of the developed tools on empirical data and to ensure that neural networks trained on simulations perform well on empirical data. Over the years, numerous models of evolution have been published with the goal to represent as faithfully as possible the sequence evolution process and thus simulate empirical-like data. In this study, we simulated DNA and protein MSAs under increasingly complex models of evolution with and without insertion/deletion (indel) events using a state-of-the-art sequence simulator. We assessed their realism by quantifying how accurately supervised learning methods are able to predict whether a given MSA is simulated or empirical. RESULTS Our results show that we can distinguish between empirical and simulated MSAs with high accuracy using two distinct and independently developed classification approaches across all tested models of sequence evolution. Our findings suggest that the current state-of-the-art models fail to accurately replicate several aspects of empirical MSAs, including site-wise rates as well as amino acid and nucleotide composition.
Collapse
Affiliation(s)
- Johanna Trost
- Biometry and Evolutionary Biology Laboratory (LBBE), University Claude Bernard Lyon 1, Lyon, France
| | - Julia Haag
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Dimitri Höhler
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Laurent Jacob
- CNRS, IBPS, Laboratory of Computational and Quantitative Biology (LCQB), UMR 7238, Sorbonne Université, Paris 75005, France
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Biodiversity Computing Group, Institute of Computer Science, Foundation for Research and Technology - Hellas, Heraklion, Crete, Greece
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Bastien Boussau
- Biometry and Evolutionary Biology Laboratory (LBBE), University Claude Bernard Lyon 1, Lyon, France
| |
Collapse
|
15
|
Kar C, Mariyambi PC, Raghavan R, Sureshkumar S. Mitochondrial phylogeny of fusilier fishes (family Caesionidae) from the Laccadive archipelago reveals a new species and two new records from the Central Indian Ocean. JOURNAL OF FISH BIOLOGY 2023; 103:1445-1451. [PMID: 37667092 DOI: 10.1111/jfb.15553] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/03/2023] [Revised: 08/30/2023] [Accepted: 08/31/2023] [Indexed: 09/06/2023]
Abstract
Fusiliers of the family Caesionidae comprise a group of Indo-Pacific reef fishes important in the live bait and artisanal fisheries in many parts of its range, particularly in the Indian Ocean region. Using newly generated mitochondrial COI sequences of 10 species of caesionid fishes from the Laccadive archipelago, we carried out a molecular phylogenetic analysis, which has helped improve our understanding of the diversity, distribution, and systematics of this poorly known group of fishes. The two speciose genera within Caesionidae, Caesio and Pterocaesio, were revealed to be paraphyletic, and as a result, four names earlier considered as subgenera within Caesionidae (Flavicaesio, Odontonectes, Pisinnicaesio, and Squamosicaesio) were elevated to the status of distinct genera. We also discovered the presence of a new lineage in the Central Indian Ocean, sister to Caesio caerulaurea and Caesio xanthalytos, but distinct from both in several morphological characters and a genetic distance of between 2% and 3% in the mitochondrial COI gene. We describe this lineage as Caesio idreesi, a new species, with a distribution spanning the Laccadive Sea and the Bay of Bengal. Our genetic data also helped confirm the first confirmed records of two species, Pisinnicaesio digramma and Squamosicaesio randalli, from the Central Indian Ocean, and a new distribution record for C. xanthalytos in the Laccadive Sea. Combined, these results have helped bridge key biodiversity knowledge gaps of the family Caesionidae and form an excellent baseline for further investigations on their taxonomy, systematics, and life history.
Collapse
Affiliation(s)
- Chinmay Kar
- Department of Marine Biology, Biodiversity Laboratory, Faculty of Ocean Science and Technology, Kerala University of Fisheries and Ocean Studies (KUFOS), Kochi, India
| | - Puthiyara Chetta Mariyambi
- Department of Marine Biology, Biodiversity Laboratory, Faculty of Ocean Science and Technology, Kerala University of Fisheries and Ocean Studies (KUFOS), Kochi, India
| | - Rajeev Raghavan
- Department of Fisheries Resource Management, Kerala University of Fisheries and Ocean Studies (KUFOS), Kochi, India
| | - Sivanpillai Sureshkumar
- Department of Marine Biology, Biodiversity Laboratory, Faculty of Ocean Science and Technology, Kerala University of Fisheries and Ocean Studies (KUFOS), Kochi, India
| |
Collapse
|
16
|
Smith ML, Hahn MW. Phylogenetic inference using generative adversarial networks. Bioinformatics 2023; 39:btad543. [PMID: 37669126 PMCID: PMC10500083 DOI: 10.1093/bioinformatics/btad543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Revised: 08/25/2023] [Accepted: 09/04/2023] [Indexed: 09/07/2023] Open
Abstract
MOTIVATION The application of machine learning approaches in phylogenetics has been impeded by the vast model space associated with inference. Supervised machine learning approaches require data from across this space to train models. Because of this, previous approaches have typically been limited to inferring relationships among unrooted quartets of taxa, where there are only three possible topologies. Here, we explore the potential of generative adversarial networks (GANs) to address this limitation. GANs consist of a generator and a discriminator: at each step, the generator aims to create data that is similar to real data, while the discriminator attempts to distinguish generated and real data. By using an evolutionary model as the generator, we use GANs to make evolutionary inferences. Since a new model can be considered at each iteration, heuristic searches of complex model spaces are possible. Thus, GANs offer a potential solution to the challenges of applying machine learning in phylogenetics. RESULTS We developed phyloGAN, a GAN that infers phylogenetic relationships among species. phyloGAN takes as input a concatenated alignment, or a set of gene alignments, and infers a phylogenetic tree either considering or ignoring gene tree heterogeneity. We explored the performance of phyloGAN for up to 15 taxa in the concatenation case and 6 taxa when considering gene tree heterogeneity. Error rates are relatively low in these simple cases. However, run times are slow and performance metrics suggest issues during training. Future work should explore novel architectures that may result in more stable and efficient GANs for phylogenetics. AVAILABILITY AND IMPLEMENTATION phyloGAN is available on github: https://github.com/meganlsmith/phyloGAN/.
Collapse
Affiliation(s)
- Megan L Smith
- Department of Biology, Indiana University, 1001 E 3rd St, Bloomington, IN 47405, United States
| | - Matthew W Hahn
- Department of Biology, Indiana University, 1001 E 3rd St, Bloomington, IN 47405, United States
- Department of Computer Science, Indiana University, 700 N Woodlawn Avenue, Bloomington, IN 47408, United States
| |
Collapse
|
17
|
Ly-Trong N, Barca GMJ, Minh BQ. AliSim-HPC: parallel sequence simulator for phylogenetics. Bioinformatics 2023; 39:btad540. [PMID: 37656933 PMCID: PMC10534053 DOI: 10.1093/bioinformatics/btad540] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Revised: 08/16/2023] [Accepted: 08/31/2023] [Indexed: 09/03/2023] Open
Abstract
MOTIVATION Sequence simulation plays a vital role in phylogenetics with many applications, such as evaluating phylogenetic methods, testing hypotheses, and generating training data for machine-learning applications. We recently introduced a new simulator for multiple sequence alignments called AliSim, which outperformed existing tools. However, with the increasing demands of simulating large data sets, AliSim is still slow due to its sequential implementation; for example, to simulate millions of sequence alignments, AliSim took several days or weeks. Parallelization has been used for many phylogenetic inference methods but not yet for sequence simulation. RESULTS This paper introduces AliSim-HPC, which, for the first time, employs high-performance computing for phylogenetic simulations. AliSim-HPC parallelizes the simulation process at both multi-core and multi-CPU levels using the OpenMP and message passing interface (MPI) libraries, respectively. AliSim-HPC is highly efficient and scalable, which reduces the runtime to simulate 100 large gap-free alignments (30 000 sequences of one million sites) from over one day to 11 min using 256 CPU cores from a cluster with six computing nodes, a 153-fold speedup. While the OpenMP version can only simulate gap-free alignments, the MPI version supports insertion-deletion models like the sequential AliSim. AVAILABILITY AND IMPLEMENTATION AliSim-HPC is open-source and available as part of the new IQ-TREE version v2.2.3 at https://github.com/iqtree/iqtree2/releases with a user manual at http://www.iqtree.org/doc/AliSim.
Collapse
Affiliation(s)
- Nhan Ly-Trong
- School of Computing, College of Engineering, Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia
| | - Giuseppe M J Barca
- School of Computing, College of Engineering, Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia
| | - Bui Quang Minh
- School of Computing, College of Engineering, Computing and Cybernetics, Australian National University, Canberra, ACT 2600, Australia
| |
Collapse
|
18
|
Rivera-Rivera CJ, Grbic D. CastNet: a systems-level sequence evolution simulator. BMC Bioinformatics 2023; 24:247. [PMID: 37308829 DOI: 10.1186/s12859-023-05366-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2023] [Accepted: 05/26/2023] [Indexed: 06/14/2023] Open
Abstract
BACKGROUND Simulating DNA evolution has been done through coevolution-agnostic probabilistic frameworks for the past 3 decades. The most common implementation is by using the converse of the probabilistic approach used to infer phylogenies which, in the simplest form, simulates a single sequence at a time. However, biological systems are multi-genic, and gene products can affect each other's evolutionary paths through coevolution. These crucial evolutionary dynamics still remain to be simulated, and we believe that modelling them can lead to profound insights for comparative genomics. RESULTS Here we present CastNet, a genome evolution simulator that assumes each genome is a collection of genes with constantly evolving regulatory interactions in between them. The regulatory interactions produce a phenotype in the form of gene expression profiles, upon which fitness is calculated. A genetic algorithm is then used to evolve a population of such entities through a user-defined phylogeny. Importantly, the regulatory mutations are a response to sequence mutations, thus making a 1-1 relationship between the rate of evolution of sequences and of regulatory parameters. This is, to our knowledge, the first time the evolution of sequences and regulation have been explicitly linked in a simulation, despite there being a multitude of sequence evolution simulators, and a handful of models to simulate Gene Regulatory Network (GRN) evolution. In our test runs, we see a coevolutionary signal among genes that are active in the GRN, and neutral evolution in genes that are not included in the network, showing that selective pressures imposed on the regulatory output of the genes are reflected in their sequences. CONCLUSION We believe that CastNet represents a substantial step for developing new tools to study genome evolution, and more broadly, coevolutionary webs and complex evolving systems. This simulator also provides a new framework to study molecular evolution where sequence coevolution has a leading role.
Collapse
Affiliation(s)
| | - Djordje Grbic
- IT-University of Copenhagen, Rued Langgaards Vej 7, 2300, Copenhagen, Denmark
| |
Collapse
|
19
|
Smith CH, Pinto BJ, Kirkpatrick M, Hillis DM, Pfeiffer JM, Havird JC. A tale of two paths: The evolution of mitochondrial recombination in bivalves with doubly uniparental inheritance. J Hered 2023; 114:199-206. [PMID: 36897956 PMCID: PMC10212130 DOI: 10.1093/jhered/esad004] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2022] [Accepted: 01/19/2023] [Indexed: 03/12/2023] Open
Abstract
In most animals, mitochondrial DNA is strictly maternally inherited and non-recombining. One exception to this pattern is called doubly uniparental inheritance (DUI), a phenomenon involving the independent transmission of female and male mitochondrial genomes. DUI is known only from the molluskan class Bivalvia. The phylogenetic distribution of male-transmitted mitochondrial DNA (M mtDNA) in bivalves is consistent with several evolutionary scenarios, including multiple independent gains, losses, and varying degrees of recombination with female-transmitted mitochondrial DNA (F mtDNA). In this study, we use phylogenetic methods to test M mtDNA origination hypotheses and infer the prevalence of mitochondrial recombination in bivalves with DUI. Phylogenetic modeling using site concordance factors supported a single origin of M mtDNA in bivalves coupled with recombination acting over long evolutionary timescales. Ongoing mitochondrial recombination is present in Mytilida and Venerida, which results in a pattern of concerted evolution of F mtDNA and M mtDNA. Mitochondrial recombination could be favored to offset the deleterious effects of asexual inheritance and maintain mitonuclear compatibility across tissues. Cardiida and Unionida have gone without recent recombination, possibly due to an extension of the COX2 gene in male mitochondrial DNA. The loss of recombination could be connected to the role of M mtDNA in sex determination or sexual development. Our results support that recombination events may occur throughout the mitochondrial genomes of DUI species. Future investigations may reveal more complex patterns of inheritance of recombinants, which could explain the retention of signal for a single origination of M mtDNA in protein-coding genes.
Collapse
Affiliation(s)
- Chase H Smith
- Department of Integrative Biology, University of Texas, Austin, TX, United States
| | - Brendan J Pinto
- Center for Evolutionary Medicine & Public Health, Arizona State University, Tempe, AZ, United States
- Department of Zoology, Milwaukee Public Museum, Milwaukee, WI, United States
| | - Mark Kirkpatrick
- Department of Integrative Biology, University of Texas, Austin, TX, United States
| | - David M Hillis
- Department of Integrative Biology, University of Texas, Austin, TX, United States
| | - John M Pfeiffer
- Department of Invertebrate Zoology, National Museum of Natural History, Smithsonian Institution, Washington, DC, United States
- Department of Integrative Biology, University of Texas, Austin, TX, United States
| | - Justin C Havird
- Department of Integrative Biology, University of Texas, Austin, TX, United States
| |
Collapse
|
20
|
Fleming JF, Struck TH. nRCFV: a new, dataset-size-independent metric to quantify compositional heterogeneity in nucleotide and amino acid datasets. BMC Bioinformatics 2023; 24:145. [PMID: 37046225 PMCID: PMC10099917 DOI: 10.1186/s12859-023-05270-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2022] [Accepted: 04/04/2023] [Indexed: 04/14/2023] Open
Abstract
MOTIVATION Compositional heterogeneity-when the proportions of nucleotides and amino acids are not broadly similar across the dataset-is a cause of a great number of phylogenetic artefacts. Whilst a variety of methods can identify it post-hoc, few metrics exist to quantify compositional heterogeneity prior to the computationally intensive task of phylogenetic tree reconstruction. Here we assess the efficacy of one such existing, widely used, metric: Relative Composition Frequency Variability (RCFV), using both real and simulated data. RESULTS Our results show that RCFV can be biased by sequence length, the number of taxa, and the number of possible character states within the dataset. However, we also find that missing data does not appear to have an appreciable effect on RCFV. We discuss the theory behind this, the consequences of this for the future of the usage of the RCFV value and propose a new metric, nRCFV, which accounts for these biases. Alongside this, we present a new software that calculates both RCFV and nRCFV, called nRCFV_Reader. AVAILABILITY AND IMPLEMENTATION nRCFV has been implemented in RCFV_Reader, available at: https://github.com/JFFleming/RCFV_Reader . Both our simulation and real data are available at Datadryad: https://doi.org/10.5061/dryad.wpzgmsbpn .
Collapse
Affiliation(s)
- James F Fleming
- University of Oslo Natural History Museum, Sars' Gata 1, Oslo, Norway.
| | - Torsten H Struck
- University of Oslo Natural History Museum, Sars' Gata 1, Oslo, Norway
| |
Collapse
|
21
|
Sukhorukov GA, Paramonov AI, Lisak OV, Kozlova IV, Bazykin GA, Neverov AD, Karan LS. The Baikal subtype of tick-borne encephalitis virus is evident of recombination between Siberian and Far-Eastern subtypes. PLoS Negl Trop Dis 2023; 17:e0011141. [PMID: 36972237 PMCID: PMC10079218 DOI: 10.1371/journal.pntd.0011141] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 04/06/2023] [Accepted: 02/06/2023] [Indexed: 03/29/2023] Open
Abstract
Tick-borne encephalitis virus (TBEV) is a flavivirus which causes an acute or sometimes chronic infection that frequently has severe neurological consequences, and is a major public health threat in Eurasia. TBEV is genetically classified into three distinct subtypes; however, at least one group of isolates, the Baikal subtype, also referred to as “886-84-like”, challenges this classification. Baikal TBEV is a persistent group which has been repeatedly isolated from ticks and small mammals in the Buryat Republic, Irkutsk and Trans-Baikal regions of Russia for several decades. One case of meningoencephalitis with a lethal outcome caused by this subtype has been described in Mongolia in 2010. While recombination is frequent in Flaviviridae, its role in the evolution of TBEV has not been established. Here, we isolate and sequence four novel Baikal TBEV samples obtained in Eastern Siberia. Using a set of methods for inference of recombination events, including a newly developed phylogenetic method allowing for formal statistical testing for such events in the past, we find robust support for a difference in phylogenetic histories between genomic regions, indicating recombination at origin of the Baikal TBEV. This finding extends our understanding of the role of recombination in the evolution of this human pathogen.
Collapse
Affiliation(s)
- Grigorii A. Sukhorukov
- Center of Life Sciences, Skolkovo Institute of Science and Technology, Skolkovo, Russia
- * E-mail: (GAS); (GAB); (ADN)
| | - Alexey I. Paramonov
- Laboratory of molecular Epidemiology and genetic diagnosis, Scientific Centre for Family Health and Human Reproduction Problems, Irkutsk, Russia
| | - Oksana V. Lisak
- Laboratory of molecular Epidemiology and genetic diagnosis, Scientific Centre for Family Health and Human Reproduction Problems, Irkutsk, Russia
| | - Irina V. Kozlova
- Laboratory of molecular Epidemiology and genetic diagnosis, Scientific Centre for Family Health and Human Reproduction Problems, Irkutsk, Russia
| | - Georgii A. Bazykin
- Center of Life Sciences, Skolkovo Institute of Science and Technology, Skolkovo, Russia
- Laboratory of Molecular Evolution, Kharkevich Institute for Information Transmission Problems of the RAS, Moscow, Russia
- * E-mail: (GAS); (GAB); (ADN)
| | - Alexey D. Neverov
- HSE University, Moscow, Russia
- Department of Molecular Diagnostics, Central Research Institute for Epidemiology, Moscow, Russia
- * E-mail: (GAS); (GAB); (ADN)
| | - Lyudmila S. Karan
- Department of Molecular Diagnostics, Central Research Institute for Epidemiology, Moscow, Russia
| |
Collapse
|
22
|
Legall N, Salvador LCM. Selective sweep sites and SNP dense regions differentiate Mycobacterium bovis isolates across scales. Front Microbiol 2022; 13:787856. [PMID: 36160199 PMCID: PMC9489834 DOI: 10.3389/fmicb.2022.787856] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2021] [Accepted: 08/08/2022] [Indexed: 11/28/2022] Open
Abstract
Mycobacterium bovis, a bacterial zoonotic pathogen responsible for the economically and agriculturally important livestock disease bovine tuberculosis (bTB), infects a broad mammalian host range worldwide. This characteristic has led to bidirectional transmission events between livestock and wildlife species as well as the formation of wildlife reservoirs, impacting the success of bTB control measures. Next Generation Sequencing (NGS) has transformed our ability to understand disease transmission events by tracking variant sites, however the genomic signatures related to host adaptation following spillover, alongside the role of other genomic factors in the M. bovis transmission process are understudied problems. We analyzed publicly available M. bovis datasets collected from 700 hosts across three countries with bTB endemic regions (United Kingdom, United States, and New Zealand) to investigate if genomic regions with high SNP density and/or selective sweep sites play a role in Mycobacterium bovis adaptation to new environments (e.g., at the host-species, geographical, and/or sub-population levels). A simulated M. bovis alignment was created to generate null distributions for defining genomic regions with high SNP counts and regions with selective sweeps evidence. Random Forest (RF) models were used to investigate evolutionary metrics within the genomic regions of interest to determine which genomic processes were the best for classifying M. bovis across ecological scales. We identified in the M. bovis genomes 14 and 132 high SNP density and selective sweep regions, respectively. Selective sweep regions were ranked as the most important in classifying M. bovis across the different scales in all RF models. SNP dense regions were found to have high importance in the badger and cattle specific RF models in classifying badger derived isolates from livestock derived ones. Additionally, the genes detected within these genomic regions harbor various pathogenic functions such as virulence and immunogenicity, membrane structure, host survival, and mycobactin production. The results of this study demonstrate how comparative genomics alongside machine learning approaches are useful to investigate further the nature of M. bovis host-pathogen interactions.
Collapse
Affiliation(s)
- Noah Legall
- Interdisciplinary Disease Ecology Across Scales Research Traineeship Program, University of Georgia, Athens, GA, United States
- Institute of Bioinformatics, University of Georgia, Athens, GA, United States
- Center for the Ecology of Infectious Diseases, University of Georgia, Athens, GA, United States
| | - Liliana C. M. Salvador
- Institute of Bioinformatics, University of Georgia, Athens, GA, United States
- Center for the Ecology of Infectious Diseases, University of Georgia, Athens, GA, United States
- Department of Infectious Diseases, College of Veterinary Medicine, University of Georgia, Athens, GA, United States
| |
Collapse
|