1
|
Fogg J, Allman ES, Ané C. PhyloCoalSimulations: A Simulator for Network Multispecies Coalescent Models, Including a New Extension for the Inheritance of Gene Flow. Syst Biol 2023; 72:1171-1179. [PMID: 37254872 DOI: 10.1093/sysbio/syad030] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Revised: 05/03/2023] [Accepted: 05/15/2023] [Indexed: 06/01/2023] Open
Abstract
We consider the evolution of phylogenetic gene trees along phylogenetic species networks, according to the network multispecies coalescent process, and introduce a new network coalescent model with correlated inheritance of gene flow. This model generalizes two traditional versions of the network coalescent: with independent or common inheritance. At each reticulation, multiple lineages of a given locus are inherited from parental populations chosen at random, either independently across lineages or with positive correlation according to a Dirichlet process. This process may account for locus-specific probabilities of inheritance, for example. We implemented the simulation of gene trees under these network coalescent models in the Julia package PhyloCoalSimulations, which depends on PhyloNetworks and its powerful network manipulation tools. Input species phylogenies can be read in extended Newick format, either in numbers of generations or in coalescent units. Simulated gene trees can be written in Newick format, and in a way that preserves information about their embedding within the species network. This embedding can be used for downstream purposes, such as to simulate species-specific processes like rate variation across species, or for other scenarios as illustrated in this note. This package should be useful for simulation studies and simulation-based inference methods. The software is available open source with documentation and a tutorial at https://github.com/cecileane/PhyloCoalSimulations.jl.
Collapse
Affiliation(s)
- John Fogg
- Department of Statistics, University of Wisconsin - Madison, WI, 53706, USA
| | - Elizabeth S Allman
- Department of Mathematics and Statistics, University of Alaska - Fairbanks, AK, 99775, USA
| | - Cécile Ané
- Department of Statistics, University of Wisconsin - Madison, WI, 53706, USA
- Department of Botany, University of Wisconsin - Madison, WI, 53706, USA
| |
Collapse
|
2
|
Casanellas M, Fernández-Sánchez J, Garrote-López M, Sabaté-Vidales M. Designing Weights for Quartet-Based Methods When Data are Heterogeneous Across Lineages. Bull Math Biol 2023; 85:68. [PMID: 37310552 DOI: 10.1007/s11538-023-01167-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Accepted: 05/15/2023] [Indexed: 06/14/2023]
Abstract
Homogeneity across lineages is a general assumption in phylogenetics according to which nucleotide substitution rates are common to all lineages. Many phylogenetic methods relax this hypothesis but keep a simple enough model to make the process of sequence evolution more tractable. On the other hand, dealing successfully with the general case (heterogeneity of rates across lineages) is one of the key features of phylogenetic reconstruction methods based on algebraic tools. The goal of this paper is twofold. First, we present a new weighting system for quartets (ASAQ) based on algebraic and semi-algebraic tools, thus especially indicated to deal with data evolving under heterogeneous rates. This method combines the weights of two previous methods by means of a test based on the positivity of the branch lengths estimated with the paralinear distance. ASAQ is statistically consistent when applied to data generated under the general Markov model, considers rate and base composition heterogeneity among lineages and does not assume stationarity nor time-reversibility. Second, we test and compare the performance of several quartet-based methods for phylogenetic tree reconstruction (namely QFM, wQFM, quartet puzzling, weight optimization and Willson's method) in combination with several systems of weights, including ASAQ weights and other weights based on algebraic and semi-algebraic methods or on the paralinear distance. These tests are applied to both simulated and real data and support weight optimization with ASAQ weights as a reliable and successful reconstruction method that improves upon the accuracy of global methods (such as neighbor-joining or maximum likelihood) in the presence of long branches or on mixtures of distributions on trees.
Collapse
Affiliation(s)
- Marta Casanellas
- Institut de Matematiques de la UPC-BarcelonaTech (IMTech), Universitat Politècnica de Catalunya and Centre de Recerca Matemàtica, Av. Diagonal 647, 08028, Barcelona, Spain.
| | - Jesús Fernández-Sánchez
- Institut de Matematiques de la UPC-BarcelonaTech (IMTech), Universitat Politècnica de Catalunya and Centre de Recerca Matemàtica, Av. Diagonal 647, 08028, Barcelona, Spain
| | | | | |
Collapse
|
3
|
Allman ES, Banos H, Rhodes JA. Testing Multispecies Coalescent Simulators Using Summary Statistics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1613-1618. [PMID: 35617176 PMCID: PMC10183998 DOI: 10.1109/tcbb.2022.3177956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
As genomic-scale datasets motivate research on species tree inference, simulators of the multispecies coalescent (MSC) process have become essential for the testing and evaluation of new inference methods. However, the simulators themselves must be tested to ensure that they give valid samples. This work develops methods for checking whether a collection of gene trees is in accord with the MSC model on a given species tree. When applied to well-known simulators, we find that several give flawed samples. The tests presented are capable of validating both topological and metric properties of gene tree samples, and are implemented in a freely available R package MSCsimtester so that developers and users may easily apply them.
Collapse
|
4
|
Hill M, Legried B, Roch S. Species tree estimation under joint modeling of coalescence and duplication: Sample complexity of quartet methods. ANN APPL PROBAB 2022. [DOI: 10.1214/22-aap1799] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Affiliation(s)
- Max Hill
- Department of Mathematics, University of Wisconsin–Madison
| | | | - Sebastien Roch
- Department of Mathematics, University of Wisconsin–Madison
| |
Collapse
|
5
|
Dasarathy G, Mossel E, Nowak R, Roch S. A stochastic Farris transform for genetic data under the multispecies coalescent with applications to data requirements. J Math Biol 2022; 84:36. [PMID: 35394192 PMCID: PMC9258723 DOI: 10.1007/s00285-022-01731-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Revised: 02/15/2022] [Accepted: 02/17/2022] [Indexed: 10/18/2022]
Abstract
Species tree estimation faces many significant hurdles. Chief among them is that the trees describing the ancestral lineages of each individual gene-the gene trees-often differ from the species tree. The multispecies coalescent is commonly used to model this gene tree discordance, at least when it is believed to arise from incomplete lineage sorting, a population-genetic effect. Another significant challenge in this area is that molecular sequences associated to each gene typically provide limited information about the gene trees themselves. While the modeling of sequence evolution by single-site substitutions is well-studied, few species tree reconstruction methods with theoretical guarantees actually address this latter issue. Instead, a standard-but unsatisfactory-assumption is that gene trees are perfectly reconstructed before being fed into a so-called summary method. Hence much remains to be done in the development of inference methodologies that rigorously account for gene tree estimation error-or completely avoid gene tree estimation in the first place. In previous work, a data requirement trade-off was derived between the number of loci m needed for an accurate reconstruction and the length of the locus sequences k. It was shown that to reconstruct an internal branch of length f, one needs m to be of the order of [Formula: see text]. That previous result was obtained under the restrictive assumption that mutation rates as well as population sizes are constant across the species phylogeny. Here we further generalize this result beyond this assumption. Our main contribution is a novel reduction to the molecular clock case under the multispecies coalescent, which we refer to as a stochastic Farris transform. As a corollary, we also obtain a new identifiability result of independent interest: for any species tree with [Formula: see text] species, the rooted topology of the species tree can be identified from the distribution of its unrooted weighted gene trees even in the absence of a molecular clock.
Collapse
Affiliation(s)
- Gautam Dasarathy
- School of Electrical, Computer, and Energy Engineering, Arizona State University, Tempe, USA
| | - Elchanan Mossel
- Department of Mathematics and IDSS, Massachusetts Institute of Technology, Cambridge, USA
| | - Robert Nowak
- Department of Electrical and Computer Engineering, University of Wisconsin, Madison, USA
| | - Sebastien Roch
- Department of Mathematics, University of Wisconsin, Madison, USA.
| |
Collapse
|
6
|
Identifiability of species network topologies from genomic sequences using the logDet distance. J Math Biol 2022; 84:35. [PMID: 35385988 DOI: 10.1007/s00285-022-01734-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Revised: 01/12/2022] [Accepted: 03/02/2022] [Indexed: 10/18/2022]
Abstract
Inference of network-like evolutionary relationships between species from genomic data must address the interwoven signals from both gene flow and incomplete lineage sorting. The heavy computational demands of standard approaches to this problem severely limit the size of datasets that may be analyzed, in both the number of species and the number of genetic loci. Here we provide a theoretical pointer to more efficient methods, by showing that logDet distances computed from genomic-scale sequences retain sufficient information to recover network relationships in the level-1 ultrametric case. This result is obtained under the Network Multispecies Coalescent model combined with a mixture of General Time-Reversible sequence evolution models across individual gene trees. It applies to both unlinked site data, such as for SNPs, and to sequence data in which many contiguous sites may have evolved on a common tree, such as concatenated gene sequences. Thus under standard stochastic models statistically justifiable inference of network relationships from sequences can be accomplished without consideration of individual genes or gene trees.
Collapse
|
7
|
Yourdkhani S, Allman ES, Rhodes JA. Parameter Identifiability for a Profile Mixture Model of Protein Evolution. J Comput Biol 2021; 28:570-586. [PMID: 33960831 DOI: 10.1089/cmb.2020.0315] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
A profile mixture (PM) model is a model of protein evolution, describing sequence data in which sites are assumed to follow many related substitution processes on a single evolutionary tree. The processes depend, in part, on different amino acid distributions, or profiles, varying over sites in aligned sequences. A fundamental question for any stochastic model, which must be answered positively to justify model-based inference, is whether the parameters are identifiable from the probability distribution they determine. Here, using algebraic methods, we show that a PM model has identifiable parameters under circumstances in which it is likely to be used for empirical analyses. In particular, for a tree relating 9 or more taxa, both the tree topology and all numerical parameters are generically identifiable when the number of profiles is less than 74.
Collapse
Affiliation(s)
- Samaneh Yourdkhani
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, Alaska, USA
| | - Elizabeth S Allman
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, Alaska, USA
| | - John A Rhodes
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, Alaska, USA
| |
Collapse
|
8
|
Bhattacharjee A, Bayzid MS. Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices. BMC Genomics 2020; 21:497. [PMID: 32689946 PMCID: PMC7370488 DOI: 10.1186/s12864-020-06892-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2020] [Accepted: 07/07/2020] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND With the rapid growth rate of newly sequenced genomes, species tree inference from genes sampled throughout the whole genome has become a basic task in comparative and evolutionary biology. However, substantial challenges remain in leveraging these large scale molecular data. One of the foremost challenges is to develop efficient methods that can handle missing data. Popular distance-based methods, such as NJ (neighbor joining) and UPGMA (unweighted pair group method with arithmetic mean) require complete distance matrices without any missing data. RESULTS We introduce two highly accurate machine learning based distance imputation techniques. These methods are based on matrix factorization and autoencoder based deep learning architectures. We evaluated these two methods on a collection of simulated and biological datasets. Experimental results suggest that our proposed methods match or improve upon the best alternate distance imputation techniques. Moreover, these methods are scalable to large datasets with hundreds of taxa, and can handle a substantial amount of missing data. CONCLUSIONS This study shows, for the first time, the power and feasibility of applying deep learning techniques for imputing distance matrices. Thus, this study advances the state-of-the-art in phylogenetic tree construction in the presence of missing data. The proposed methods are available in open source form at https://github.com/Ananya-Bhattacharjee/ImputeDistances .
Collapse
Affiliation(s)
- Ananya Bhattacharjee
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205 Bangladesh
- Department of Computer Science and Engineering, Eastern University, Dhaka, Bangladesh
| | - Md. Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205 Bangladesh
| |
Collapse
|
9
|
Allman ES, Baños H, Rhodes JA. NANUQ: a method for inferring species networks from gene trees under the coalescent model. Algorithms Mol Biol 2019; 14:24. [PMID: 31827592 PMCID: PMC6896299 DOI: 10.1186/s13015-019-0159-2] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2019] [Accepted: 11/07/2019] [Indexed: 01/07/2023] Open
Abstract
Species networks generalize the notion of species trees to allow for hybridization or other lateral gene transfer. Under the network multispecies coalescent model, individual gene trees arising from a network can have any topology, but arise with frequencies dependent on the network structure and numerical parameters. We propose a new algorithm for statistical inference of a level-1 species network under this model, from data consisting of gene tree topologies, and provide the theoretical justification for it. The algorithm is based on an analysis of quartets displayed on gene trees, combining several statistical hypothesis tests with combinatorial ideas such as a quartet-based intertaxon distance appropriate to networks, the NeighborNet algorithm for circular split systems, and the Circular Network algorithm for constructing a splits graph.
Collapse
|
10
|
Molloy EK, Warnow T. Statistically consistent divide-and-conquer pipelines for phylogeny estimation using NJMerge. Algorithms Mol Biol 2019; 14:14. [PMID: 31360216 PMCID: PMC6642500 DOI: 10.1186/s13015-019-0151-x] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2019] [Accepted: 06/13/2019] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND Divide-and-conquer methods, which divide the species set into overlapping subsets, construct a tree on each subset, and then combine the subset trees using a supertree method, provide a key algorithmic framework for boosting the scalability of phylogeny estimation methods to large datasets. Yet the use of supertree methods, which typically attempt to solve NP-hard optimization problems, limits the scalability of such approaches. RESULTS In this paper, we introduce a divide-and-conquer approach that does not require supertree estimation: we divide the species set into pairwise disjoint subsets, construct a tree on each subset using a base method, and then combine the subset trees using a distance matrix. For this merger step, we present a new method, called NJMerge, which is a polynomial-time extension of Neighbor Joining (NJ); thus, NJMerge can be viewed either as a method for improving traditional NJ or as a method for scaling the base method to larger datasets. We prove that NJMerge can be used to create divide-and-conquer pipelines that are statistically consistent under some models of evolution. We also report the results of an extensive simulation study evaluating NJMerge on multi-locus datasets with up to 1000 species. We found that NJMerge sometimes improved the accuracy of traditional NJ and substantially reduced the running time of three popular species tree methods (ASTRAL-III, SVDquartets, and "concatenation" using RAxML) without sacrificing accuracy. Finally, although NJMerge can fail to return a tree, in our experiments, NJMerge failed on only 11 out of 2560 test cases. CONCLUSIONS Theoretical and empirical results suggest that NJMerge is a valuable technique for large-scale phylogeny estimation, especially when computational resources are limited. NJMerge is freely available on Github (http://github.com/ekmolloy/njmerge).
Collapse
Affiliation(s)
- Erin K. Molloy
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, IL 61801 USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, IL 61801 USA
| |
Collapse
|