1
|
Li Q, Chan YB, Galtier N, Scornavacca C. The Effect of Copy Number Hemiplasy on Gene Family Evolution. Syst Biol 2024; 73:355-374. [PMID: 38330161 DOI: 10.1093/sysbio/syae007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Revised: 01/24/2024] [Accepted: 02/03/2024] [Indexed: 02/10/2024] Open
Abstract
The evolution of gene families is complex, involving gene-level evolutionary events such as gene duplication, horizontal gene transfer, and gene loss, and other processes such as incomplete lineage sorting (ILS). Because of this, topological differences often exist between gene trees and species trees. A number of models have been recently developed to explain these discrepancies, the most realistic of which attempts to consider both gene-level events and ILS. When unified in a single model, the interaction between ILS and gene-level events can cause polymorphism in gene copy number, which we refer to as copy number hemiplasy (CNH). In this paper, we extend the Wright-Fisher process to include duplications and losses over several species, and show that the probability of CNH for this process can be significant. We study how well two unified models-multilocus multispecies coalescent (MLMSC), which models CNH, and duplication, loss, and coalescence (DLCoal), which does not-approximate the Wright-Fisher process with duplication and loss. We then study the effect of CNH on gene family evolution by comparing MLMSC and DLCoal. We generate comparable gene trees under both models, showing significant differences in various summary statistics; most importantly, CNH reduces the number of gene copies greatly. If this is not taken into account, the traditional method of estimating duplication rates (by counting the number of gene copies) becomes inaccurate. The simulated gene trees are also used for species tree inference with the summary methods ASTRAL and ASTRAL-Pro, demonstrating that their accuracy, based on CNH-unaware simulations calibrated on real data, may have been overestimated.
Collapse
Affiliation(s)
- Qiuyi Li
- School of Mathematics and Statistics/Melbourne Integrative Genomics, The University of Melbourne, Melbourne 3010, Australia
- Alibaba Cloud, Hangzhou, China
| | - Yao-Ban Chan
- School of Mathematics and Statistics/Melbourne Integrative Genomics, The University of Melbourne, Melbourne 3010, Australia
| | - Nicolas Galtier
- Institut des Sciences de lEvolution, Université Montpellier, CNRS, IRD, EPHE, Montpellier 34095, France
| | - Celine Scornavacca
- Institut des Sciences de l'Evolution, Université Montpellier, CNRS, IRD, EPHE, Montpellier 34095, France
| |
Collapse
|
2
|
Sánchez KI, Diaz Huesa EG, Breitman MF, Avila LJ, Sites JW, Morando M. Complex Patterns of Diversification in the Gray Zone of Speciation: Model-Based Approaches Applied to Patagonian Liolaemid Lizards (Squamata: Liolaemus kingii clade). Syst Biol 2023; 72:739-752. [PMID: 37097104 DOI: 10.1093/sysbio/syad019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Revised: 03/28/2023] [Accepted: 04/11/2023] [Indexed: 04/26/2023] Open
Abstract
In this study we detangled the evolutionary history of the Patagonian lizard clade Liolaemus kingii, coupling dense geographic sampling and novel computational analytical approaches. We analyzed nuclear and mitochondrial data (restriction site-associated DNA sequencing and cytochrome b) to hypothesize and evaluate species limits, phylogenetic relationships, and demographic histories. We complemented these analyses with posterior predictive simulations to assess the fit of the genomic data to the multispecies coalescent model. We also employed a novel approach to time-calibrate a phylogenetic network. Our results show several instances of mito-nuclear discordance and consistent support for a reticulated history, supporting the view that the complex evolutionary history of the kingii clade is characterized by extensive gene flow and rapid diversification events. We discuss our findings in the contexts of the "gray zone" of speciation, phylogeographic patterns in the Patagonian region, and taxonomic outcomes. [Model adequacy; multispecies coalescent; multispecies network coalescent; phylogenomics; species delimitation.].
Collapse
Affiliation(s)
- Kevin I Sánchez
- Instituto Patagónico para el Estudio de los Ecosistemas Continentales, Consejo Nacional de Investigaciones Científicas y Técnicas (IPEEC-CONICET), Puerto Madryn, U9120ACD, Argentina
| | - Emilce G Diaz Huesa
- Instituto de Diversidad y Evolución Austral, Consejo Nacional de Investigaciones Científicas y Técnicas (IDEAus-CONICET), Puerto Madryn, U9120ACD, Argentina
| | - María F Breitman
- Department of Biology and Environmental Science, Auburn University at Montgomery, Montgomery, 36117, USA
| | - Luciano J Avila
- Instituto Patagónico para el Estudio de los Ecosistemas Continentales, Consejo Nacional de Investigaciones Científicas y Técnicas (IPEEC-CONICET), Puerto Madryn, U9120ACD, Argentina
| | - Jack W Sites
- Department of Biology, Austin Peay State University, Clarksville, 37044, USA
| | - Mariana Morando
- Instituto Patagónico para el Estudio de los Ecosistemas Continentales, Consejo Nacional de Investigaciones Científicas y Técnicas (IPEEC-CONICET), Puerto Madryn, U9120ACD, Argentina
- Universidad Nacional de la Patagonia San Juan Bosco (UNPSJB), Puerto Madryn, U9120ACD, Argentina
| |
Collapse
|
3
|
Chan YB, Li Q, Scornavacca C. The large-sample asymptotic behaviour of quartet-based summary methods for species tree inference. J Math Biol 2022; 85:22. [PMID: 35976512 PMCID: PMC9385842 DOI: 10.1007/s00285-022-01786-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Revised: 06/08/2022] [Accepted: 07/14/2022] [Indexed: 12/03/2022]
Abstract
Summary methods seek to infer a species tree from a set of gene trees. A desirable property of such methods is that of statistical consistency; that is, the probability of inferring the wrong species tree (the error probability) tends to 0 as the number of input gene trees becomes large. A popular paradigm is to infer a species tree that agrees with the maximum number of quartets from the input set of gene trees; this has been proved to be statistically consistent under several models of gene evolution. In this paper, we study the asymptotic behaviour of the error probability of such methods in this limit, and show that it decays exponentially. For a 4-taxon species tree, we derive a closed form for the asymptotic behaviour in terms of the probability that the gene evolution process produces the correct topology. We also derive bounds for the sample complexity (the number of gene trees required to infer the true species tree with a given probability), which outperform existing bounds. We then extend our results to bounds for the asymptotic behaviour of the error probability for any species tree, and compare these to the true error probability for some model species trees using simulations.
Collapse
Affiliation(s)
- Yao-Ban Chan
- School of Mathematics and Statistics / Melbourne Integrative Genomics, The University of Melbourne, Melbourne, 3010, VIC, Australia.
| | - Qiuyi Li
- School of Mathematics and Statistics / Melbourne Integrative Genomics, The University of Melbourne, Melbourne, 3010, VIC, Australia
| | - Celine Scornavacca
- Institut des Sciences de l'Evolution, Université Montpellier, CNRS, EPHE, IRD, Montpellier, 34095, France
| |
Collapse
|
4
|
Carson J, Ledda A, Ferretti L, Keeling M, Didelot X. The bounded coalescent model: Conditioning a genealogy on a minimum root date. J Theor Biol 2022; 548:111186. [PMID: 35697144 DOI: 10.1016/j.jtbi.2022.111186] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 05/05/2022] [Accepted: 06/02/2022] [Indexed: 01/27/2023]
Abstract
The coalescent model represents how individuals sampled from a population may have originated from a last common ancestor. The bounded coalescent model is obtained by conditioning the coalescent model such that the last common ancestor must have existed after a certain date. This conditioned model arises in a variety of applications, such as speciation, horizontal gene transfer or transmission analysis, and yet the bounded coalescent model has not been previously analysed in detail. Here we describe a new algorithm to simulate from this model directly, without resorting to rejection sampling. We show that this direct simulation algorithm is more computationally efficient than the rejection sampling approach. We also show how to calculate the probability of the last common ancestor occurring after a given date, which is required to compute the probability density of realisations under the bounded coalescent model. Our results are applicable in both the isochronous (when all samples have the same date) and heterochronous (where samples can have different dates) settings. We explore the effect of setting a bound on the date of the last common ancestor, and show that it affects a number of properties of the resulting phylogenies. All our methods are implemented in a new R package called BoundedCoalescent which is freely available online.
Collapse
Affiliation(s)
- Jake Carson
- Mathematics Institute, University of Warwick, United Kingdom
| | - Alice Ledda
- HCAI, Fungal, AMR, AMU & Sepsis Division, UK Health Security Agency, United Kingdom
| | - Luca Ferretti
- Big Data Institute, University of Oxford, United Kingdom
| | - Matt Keeling
- Mathematics Institute, University of Warwick, United Kingdom
| | - Xavier Didelot
- Department of Statistics and School of Life Sciences, University of Warwick, United Kingdom
| |
Collapse
|
5
|
Yan Z, Smith ML, Du P, Hahn MW, Nakhleh L. Species Tree Inference Methods Intended to Deal with Incomplete Lineage Sorting Are Robust to the Presence of Paralogs. Syst Biol 2022; 71:367-381. [PMID: 34245291 PMCID: PMC8978208 DOI: 10.1093/sysbio/syab056] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2020] [Revised: 06/23/2021] [Accepted: 06/30/2021] [Indexed: 11/24/2022] Open
Abstract
Many recent phylogenetic methods have focused on accurately inferring species trees when there is gene tree discordance due to incomplete lineage sorting (ILS). For almost all of these methods, and for phylogenetic methods in general, the data for each locus are assumed to consist of orthologous, single-copy sequences. Loci that are present in more than a single copy in any of the studied genomes are excluded from the data. These steps greatly reduce the number of loci available for analysis. The question we seek to answer in this study is: what happens if one runs such species tree inference methods on data where paralogy is present, in addition to or without ILS being present? Through simulation studies and analyses of two large biological data sets, we show that running such methods on data with paralogs can still provide accurate results. We use multiple different methods, some of which are based directly on the multispecies coalescent model, and some of which have been proven to be statistically consistent under it. We also treat the paralogous loci in multiple ways: from explicitly denoting them as paralogs, to randomly selecting one copy per species. In all cases, the inferred species trees are as accurate as equivalent analyses using single-copy orthologs. Our results have significant implications for the use of ILS-aware phylogenomic analyses, demonstrating that they do not have to be restricted to single-copy loci. This will greatly increase the amount of data that can be used for phylogenetic inference.[Gene duplication and loss; incomplete lineage sorting; multispecies coalescent; orthology; paralogy.].
Collapse
Affiliation(s)
- Zhi Yan
- Department of Computer Science, Rice University,
6100 Main Street, Houston, TX 77005, USA
| | - Megan L Smith
- Department of Biology and Department of Computer Science,
Indiana University, 1001 East Third Street, Bloomington,
IN 47405, USA
| | - Peng Du
- Department of Computer Science, Rice University,
6100 Main Street, Houston, TX 77005, USA
| | - Matthew W Hahn
- Department of Biology and Department of Computer Science,
Indiana University, 1001 East Third Street, Bloomington,
IN 47405, USA
| | - Luay Nakhleh
- Department of Computer Science, Rice University,
6100 Main Street, Houston, TX 77005, USA
- Department of BioSciences, Rice University, 6100
Main Street, Houston, TX 77005, USA
| |
Collapse
|
6
|
Morel B, Schade P, Lutteropp S, Williams TA, Szöllősi GJ, Stamatakis A. SpeciesRax: A Tool for Maximum Likelihood Species Tree Inference from Gene Family Trees under Duplication, Transfer, and Loss. Mol Biol Evol 2022; 39:msab365. [PMID: 35021210 PMCID: PMC8826479 DOI: 10.1093/molbev/msab365] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Species tree inference from gene family trees is becoming increasingly popular because it can account for discordance between the species tree and the corresponding gene family trees. In particular, methods that can account for multiple-copy gene families exhibit potential to leverage paralogy as informative signal. At present, there does not exist any widely adopted inference method for this purpose. Here, we present SpeciesRax, the first maximum likelihood method that can infer a rooted species tree from a set of gene family trees and can account for gene duplication, loss, and transfer events. By explicitly modeling events by which gene trees can depart from the species tree, SpeciesRax leverages the phylogenetic rooting signal in gene trees. SpeciesRax infers species tree branch lengths in units of expected substitutions per site and branch support values via paralogy-aware quartets extracted from the gene family trees. Using both empirical and simulated data sets we show that SpeciesRax is at least as accurate as the best competing methods while being one order of magnitude faster on large data sets at the same time. We used SpeciesRax to infer a biologically plausible rooted phylogeny of the vertebrates comprising 188 species from 31,612 gene families in 1 h using 40 cores. SpeciesRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax and on BioConda.
Collapse
Affiliation(s)
- Benoit Morel
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Paul Schade
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Sarah Lutteropp
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Tom A Williams
- School of Biological Sciences, University of Bristol, Bristol, United Kingdom
| | - Gergely J Szöllősi
- ELTE-MTA “Lendület” Evolutionary Genomics Research Group, Budapest, Hungary
- Department of Biological Physics, Eötvös University, Budapest, Hungary
- Institute of Evolution, Centre for Ecological Research, Budapest, Hungary
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| |
Collapse
|
7
|
Yan Z, Cao Z, Liu Y, Ogilvie HA, Nakhleh L. Maximum Parsimony Inference of Phylogenetic Networks in the Presence of Polyploid Complexes. Syst Biol 2021; 71:706-720. [PMID: 34605924 PMCID: PMC9017653 DOI: 10.1093/sysbio/syab081] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Revised: 09/26/2021] [Accepted: 09/29/2021] [Indexed: 12/18/2022] Open
Abstract
Phylogenetic networks provide a powerful framework for modeling and analyzing reticulate
evolutionary histories. While polyploidy has been shown to be prevalent not only in plants
but also in other groups of eukaryotic species, most work done thus far on phylogenetic
network inference assumes diploid hybridization. These inference methods have been
applied, with varying degrees of success, to data sets with polyploid species, even though
polyploidy violates the mathematical assumptions underlying these methods. Statistical
methods were developed recently for handling specific types of polyploids and so were
parsimony methods that could handle polyploidy more generally yet while excluding
processes such as incomplete lineage sorting. In this article, we introduce a new method
for inferring most parsimonious phylogenetic networks on data that include polyploid
species. Taking gene tree topologies as input, the method seeks a phylogenetic network
that minimizes deep coalescences while accounting for polyploidy. We demonstrate the
performance of the method on both simulated and biological data. The inference method as
well as a method for evaluating evolutionary hypotheses in the form of phylogenetic
networks are implemented and publicly available in the PhyloNet software package.
[Incomplete lineage sorting; minimizing deep coalescences; multilabeled trees;
multispecies network coalescent; phylogenetic networks; polyploidy.]
Collapse
Affiliation(s)
- Zhi Yan
- Department of Computer Science, Rice University, Houston, 6100 Main Street, Houston, TX 77005, USA
| | - Zhen Cao
- Department of Computer Science, Rice University, Houston, 6100 Main Street, Houston, TX 77005, USA
| | - Yushu Liu
- Department of Computer Science, Rice University, Houston, 6100 Main Street, Houston, TX 77005, USA
| | - Huw A Ogilvie
- Department of Computer Science, Rice University, Houston, 6100 Main Street, Houston, TX 77005, USA
| | - Luay Nakhleh
- Department of Computer Science, Rice University, Houston, 6100 Main Street, Houston, TX 77005, USA
- Department of Biosciences, Rice University, Houston, 6100 Main Street, Houston, TX 77005, USA
| |
Collapse
|
8
|
Zhang C, Scornavacca C, Molloy EK, Mirarab S. ASTRAL-Pro: Quartet-Based Species-Tree Inference despite Paralogy. Mol Biol Evol 2020; 37:3292-3307. [PMID: 32886770 PMCID: PMC7751180 DOI: 10.1093/molbev/msaa139] [Citation(s) in RCA: 80] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Phylogenetic inference from genome-wide data (phylogenomics) has revolutionized the study of evolution because it enables accounting for discordance among evolutionary histories across the genome. To this end, summary methods have been developed to allow accurate and scalable inference of species trees from gene trees. However, most of these methods, including the widely used ASTRAL, can only handle single-copy gene trees and do not attempt to model gene duplication and gene loss. As a result, most phylogenomic studies have focused on single-copy genes and have discarded large parts of the data. Here, we first propose a measure of quartet similarity between single-copy and multicopy trees that accounts for orthology and paralogy. We then introduce a method called ASTRAL-Pro (ASTRAL for PaRalogs and Orthologs) to find the species tree that optimizes our quartet similarity measure using dynamic programing. By studying its performance on an extensive collection of simulated data sets and on real data sets, we show that ASTRAL-Pro is more accurate than alternative methods.
Collapse
Affiliation(s)
- Chao Zhang
- Bioinformatics and Systems Biology, University of California San Diego, San Diego, CA
| | | | - Erin K Molloy
- Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, CA
| |
Collapse
|