1
|
Diamantidis D, Fan WTL, Birkner M, Wakeley J. Bursts of coalescence within population pedigrees whenever big families occur. Genetics 2024; 227:iyae030. [PMID: 38408329 DOI: 10.1093/genetics/iyae030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Revised: 01/23/2024] [Accepted: 02/18/2024] [Indexed: 02/28/2024] Open
Abstract
We consider a simple diploid population-genetic model with potentially high variability of offspring numbers among individuals. Specifically, against a backdrop of Wright-Fisher reproduction and no selection, there is an additional probability that a big family occurs, meaning that a pair of individuals has a number of offspring on the order of the population size. We study how the pedigree of the population generated under this model affects the ancestral genetic process of a sample of size two at a single autosomal locus without recombination. Our population model is of the type for which multiple-merger coalescent processes have been described. We prove that the conditional distribution of the pairwise coalescence time given the random pedigree converges to a limit law as the population size tends to infinity. This limit law may or may not be the usual exponential distribution of the Kingman coalescent, depending on the frequency of big families. But because it includes the number and times of big families, it differs from the usual multiple-merger coalescent models. The usual multiple-merger coalescent models are seen as describing the ancestral process marginal to, or averaging over, the pedigree. In the limiting ancestral process conditional on the pedigree, the intervals between big families can be modeled using the Kingman coalescent but each big family causes a discrete jump in the probability of coalescence. Analogous results should hold for larger samples and other population models. We illustrate these results with simulations and additional analysis, highlighting their implications for inference and understanding of multilocus data.
Collapse
Affiliation(s)
| | - Wai-Tong Louis Fan
- Department of Mathematics, Indiana University, Bloomington, IN 47405, USA
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA
| | - Matthias Birkner
- Institut für Mathematik, Johannes-Gutenberg-Universität, 55099 Mainz, Germany
| | - John Wakeley
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA
| |
Collapse
|
2
|
Peng J, Swofford DL, Kubatko L. Estimation of speciation times under the multispecies coalescent. Bioinformatics 2022; 38:5182-5190. [PMID: 36227122 DOI: 10.1093/bioinformatics/btac679] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2020] [Revised: 06/02/2022] [Accepted: 10/10/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION The multispecies coalescent model is now widely accepted as an effective model for incorporating variation in the evolutionary histories of individual genes into methods for phylogenetic inference from genome-scale data. However, because model-based analysis under the coalescent can be computationally expensive for large datasets, a variety of inferential frameworks and corresponding algorithms have been proposed for estimation of species-level phylogenies and associated parameters, including speciation times and effective population sizes. RESULTS We consider the problem of estimating the timing of speciation events along a phylogeny in a coalescent framework. We propose a maximum a posteriori estimator based on composite likelihood (MAPCL) for inferring these speciation times under a model of DNA sequence evolution for which exact site-pattern probabilities can be computed under the assumption of a constant θ throughout the species tree. We demonstrate that the MAPCL estimates are statistically consistent and asymptotically normally distributed, and we show how this result can be used to estimate their asymptotic variance. We also provide a more computationally efficient estimator of the asymptotic variance based on the non-parametric bootstrap. We evaluate the performance of our method using simulation and by application to an empirical dataset for gibbons. AVAILABILITY AND IMPLEMENTATION The method has been implemented in the PAUP* program, freely available at https://paup.phylosolutions.com for Macintosh, Windows and Linux operating systems. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jing Peng
- Division of Biostatistics, The Ohio State University, Columbus, OH 43210, USA
| | - David L Swofford
- Florida Museum of Natural History, University of Florida, Gainesville, FL 32611, USA
| | - Laura Kubatko
- Department of Statistics, The Ohio State University, Columbus, OH 43210, USA.,Department of Evolution, Ecology, and Organismal Biology, The Ohio State University, Columbus, OH 43210, USA.,Mathematical Biosciences Institute, The Ohio State University, Columbus, OH 43210, USA
| |
Collapse
|
3
|
Lappo E, Rosenberg NA. Approximations to the expectations and variances of ratios of tree properties under the coalescent. G3 (BETHESDA, MD.) 2022; 12:jkac205. [PMID: 35951748 PMCID: PMC9526068 DOI: 10.1093/g3journal/jkac205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Accepted: 08/01/2022] [Indexed: 11/14/2022]
Abstract
Properties of gene genealogies such as tree height (H), total branch length (L), total lengths of external (E) and internal (I) branches, mean length of basal branches (B), and the underlying coalescence times (T) can be used to study population-genetic processes and to develop statistical tests of population-genetic models. Uses of tree features in statistical tests often rely on predictions that depend on pairwise relationships among such features. For genealogies under the coalescent, we provide exact expressions for Taylor approximations to expected values and variances of ratios Xn/Yn, for all 15 pairs among the variables {Hn,Ln,En,In,Bn,Tk}, considering n leaves and 2≤k≤n. For expected values of the ratios, the approximations match closely with empirical simulation-based values. The approximations to the variances are not as accurate, but they generally match simulations in their trends as n increases. Although En has expectation 2 and Hn has expectation 2 in the limit as n→∞, the approximation to the limiting expectation for En/Hn is not 1, instead equaling π2/3-2≈1.28987. The new approximations augment fundamental results in coalescent theory on the shapes of genealogical trees.
Collapse
Affiliation(s)
- Egor Lappo
- Department of Biology, Stanford University, Stanford, CA 94305, USA
| | - Noah A Rosenberg
- Department of Biology, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
4
|
Bisschop G. Graph-based algorithms for Laplace transformed coalescence time distributions. PLoS Comput Biol 2022; 18:e1010532. [PMID: 36108047 PMCID: PMC9514611 DOI: 10.1371/journal.pcbi.1010532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Revised: 09/27/2022] [Accepted: 09/01/2022] [Indexed: 11/25/2022] Open
Abstract
Extracting information on the selective and demographic past of populations that is contained in samples of genome sequences requires a description of the distribution of the underlying genealogies. Using the Laplace transform, this distribution can be generated with a simple recursive procedure, regardless of model complexity. Assuming an infinite-sites mutation model, the probability of observing specific configurations of linked variants within small haplotype blocks can be recovered from the Laplace transform of the joint distribution of branch lengths. However, the repeated differentiation required to compute these probabilities has proven to be a serious computational bottleneck in earlier implementations. Here, I show that the state space diagram can be turned into a computational graph, allowing efficient evaluation of the Laplace transform by means of a graph traversal algorithm. This general algorithm can, for example, be applied to tabulate the likelihoods of mutational configurations in non-recombining blocks. This work provides a crucial speed up for existing composite likelihood approaches that rely on the joint distribution of branch lengths to fit isolation with migration models and estimate the parameters of selective sweeps. The associated software is available as an open-source Python library, agemo. For simple models of idealised populations, the process that generates the observed sequences can be mathematically described. For a given number of samples, we can enumerate all possible genealogies. We can even incorporate the impact of past events like population size reductions on the observed sequence variation. However, the number of possible genealogies will become very large, very fast. So, to extract information from the observed mutations, we need mathematical tools and efficient algorithms to use the information contained within the large collection of possible genealogies. The Laplace transform is one such mathematical tool that allows us to recursively generate the branch length distribution of all genealogies. Here I show how the transform can be represented as a graph. Using this nonlinear data structure, I define a general procedure to efficiently evaluate the associated mathematical expressions. And I further show how this can be used to speed up existing composite likelihood approaches to fit demographic models and estimate sweep parameters. The associated software, agemo, has a well-documented Python API and has been designed with extensibility in mind, making it an ideal back-end for many other inference approaches in population genetics.
Collapse
Affiliation(s)
- Gertjan Bisschop
- University of Edinburgh, Institute of Evolution and Ecology, Edinburgh, United Kingdom
- * E-mail:
| |
Collapse
|
5
|
Mahbub S, Sawmya S, Saha A, Reaz R, Rahman MS, Bayzid MS. Quartet Based Gene Tree Imputation Using Deep Learning Improves Phylogenomic Analyses Despite Missing Data. JOURNAL OF COMPUTATIONAL BIOLOGY : A JOURNAL OF COMPUTATIONAL MOLECULAR CELL BIOLOGY 2022; 29:1156-1172. [PMID: 36048555 DOI: 10.1089/cmb.2022.0212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Species tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, for a combination of reasons (ranging from sampling biases to more biological causes, as in gene birth and loss), gene trees are often incomplete, meaning that not all species of interest have a common set of genes. Incomplete gene trees can potentially impact the accuracy of phylogenomic inference. We, for the first time, introduce the problem of imputing the quartet distribution induced by a set of incomplete gene trees, which involves adding the missing quartets back to the quartet distribution. We present Quartet based Gene tree Imputation using Deep Learning (QT-GILD), an automated and specially tailored unsupervised deep learning technique, accompanied by cues from natural language processing, which learns the quartet distribution in a given set of incomplete gene trees and generates a complete set of quartets accordingly. QT-GILD is a general-purpose technique needing no explicit modeling of the subject system or reasons for missing data or gene tree heterogeneity. Experimental studies on a collection of simulated and empirical datasets suggest that QT-GILD can effectively impute the quartet distribution, which results in a dramatic improvement in the species tree accuracy. Remarkably, QT-GILD not only imputes the missing quartets but can also account for gene tree estimation error. Therefore, QT-GILD advances the state-of-the-art in species tree estimation from gene trees in the face of missing data.
Collapse
Affiliation(s)
- Sazan Mahbub
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh.,Department of Computer Science, University of Maryland, College Park, Maryland, USA
| | - Shashata Sawmya
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Arpita Saha
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Rezwana Reaz
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - M Sohel Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| |
Collapse
|
6
|
Kück P, Romahn J, Meusemann K. Pitfalls of the site-concordance factor (sCF) as measure of phylogenetic branch support. NAR Genom Bioinform 2022; 4:lqac064. [PMID: 36128424 PMCID: PMC9477076 DOI: 10.1093/nargab/lqac064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Revised: 08/10/2022] [Accepted: 08/17/2022] [Indexed: 12/01/2022] Open
Abstract
Confidence measures of branch reliability play an important role in phylogenetics as these measures allow to identify trees or parts of a tree that are well supported by the data and thus adequate to serve as basis for evolutionary inference of biological systems. Unreliable branch relationships in phylogenetic analyses are of concern because of their potential to represent incorrect relationships of interest among more reliable branch relationships. The site-concordance factor implemented in the IQ-TREE package is a recently introduced heuristic solution to the problem of identifying unreliable branch relationships on the basis of quartets. We test the performance of the site-concordance measure with simple examples based on simulated data and designed to study its behaviour in branch support estimates related to different degrees of branch length heterogeneities among a ten sequence tree. Our results show that in particular in cases of relationships with heterogeneous branch lengths site-concordance measures may be misleading. We therefore argue that the maximum parsimony optimality criterion currently used by the site-concordance measure may sometimes be poorly suited to evaluate branch support and that the scores reported by the site-concordance factor should not be considered as reliable.
Collapse
Affiliation(s)
- Patrick Kück
- Centre for Molecular Biodiversity Research, Leibniz Institute for the Analysis of Biodiversity Change , Adenauerallee 160, 53113 Bonn, Germany
| | - Juliane Romahn
- Centre for Molecular Biodiversity Research, Leibniz Institute for the Analysis of Biodiversity Change , Adenauerallee 160, 53113 Bonn, Germany
- LOEWE Centre for Translational Biodiversity Genomics (LOEWE-TBG) , Senckenberganlage 25, 60325 Frankfurt am Main, Germany
- Senckenberg Society for Nature Research , Senckenberganlage 25, 60325 Frankfurt am Main, Germany
| | - Karen Meusemann
- Directorate, Leibniz Institute for the Analysis of Biodiversity Change , Adenauerallee 160, 53113 Bonn, Germany
| |
Collapse
|
7
|
Pang XX, Zhang DY. Impact of Ghost Introgression on Coalescent-based Species Tree Inference and Estimation of Divergence Time. Syst Biol 2022; 72:35-49. [PMID: 35799362 DOI: 10.1093/sysbio/syac047] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Revised: 06/25/2022] [Accepted: 07/05/2022] [Indexed: 11/15/2022] Open
Abstract
The species studied in any evolutionary investigation generally constitute a small proportion of all the species currently existing or that have gone extinct. It is therefore likely that introgression, which is widespread across the tree of life, involves "ghosts," i.e., unsampled, unknown, or extinct lineages. However, the impact of ghost introgression on estimations of species trees has rarely been studied and is poorly understood. Here, we use mathematical analysis and simulations to examine the robustness of species tree methods based on the multispecies coalescent model to introgression from a ghost or extant lineage. We found that many results originally obtained for introgression between extant species can easily be extended to ghost introgression, such as the strongly interactive effects of incomplete lineage sorting (ILS) and introgression on the occurrence of anomalous gene trees (AGTs). The relative performance of the summary species tree method (ASTRAL) and the full-likelihood method (*BEAST) varies under different introgression scenarios, with the former being more robust to gene flow between non-sister species whereas the latter performing better under certain conditions of ghost introgression. When an outgroup ghost (defined as a lineage that diverged before the most basal species under investigation) acts as the donor of the introgressed genes, the time of root divergence among the investigated species generally was overestimated, whereas ingroup introgression, as commonly perceived, can only lead to underestimation. In many cases of ingroup introgression that may or may not involve ghost lineages, the stronger the ILS, the higher the accuracy achieved in estimating the time of root divergence, although the topology of the species tree is more prone to be biased by the effect of introgression.
Collapse
Affiliation(s)
- Xiao-Xu Pang
- State Key Laboratory of Earth Surface Processes and Resource Ecology and Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing 100875, China
| | - Da-Yong Zhang
- State Key Laboratory of Earth Surface Processes and Resource Ecology and Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing 100875, China
| |
Collapse
|
8
|
Tabatabaee Y, Sarker K, Warnow T. Quintet Rooting: rooting species trees under the multi-species coalescent model. Bioinformatics 2022; 38:i109-i117. [PMID: 35758805 PMCID: PMC9236578 DOI: 10.1093/bioinformatics/btac224] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Motivation Rooted species trees are a basic model with multiple applications throughout biology, including understanding adaptation, biodiversity, phylogeography and co-evolution. Because most species tree estimation methods produce unrooted trees, methods for rooting these trees have been developed. However, most rooting methods either rely on prior biological knowledge or assume that evolution is close to clock-like, which is not usually the case. Furthermore, most prior rooting methods do not account for biological processes that create discordance between gene trees and species trees. Results We present Quintet Rooting (QR), a method for rooting species trees based on a proof of identifiability of the rooted species tree under the multi-species coalescent model established by Allman, Degnan and Rhodes (J. Math. Biol., 2011). We show that QR is generally more accurate than other rooting methods, except under extreme levels of gene tree estimation error. Availability and implementation Quintet Rooting is available in open source form at https://github.com/ytabatabaee/Quintet-Rooting. The simulated datasets used in this study are from a prior study and are available at https://www.ideals.illinois.edu/handle/2142/55319. The biological dataset used in this study is also from a prior study and is available at http://gigadb.org/dataset/101041. Contact warnow@illinois.edu Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yasamin Tabatabaee
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Kowshika Sarker
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
9
|
Willson J, Roddur MS, Liu B, Zaharias P, Warnow T. DISCO: Species Tree Inference using Multicopy Gene Family Tree Decomposition. Syst Biol 2022; 71:610-629. [PMID: 34450658 PMCID: PMC9016570 DOI: 10.1093/sysbio/syab070] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Revised: 08/18/2021] [Accepted: 08/23/2021] [Indexed: 11/21/2022] Open
Abstract
Species tree inference from gene family trees is a significant problem in computational biology. However, gene tree heterogeneity, which can be caused by several factors including gene duplication and loss, makes the estimation of species trees very challenging. While there have been several species tree estimation methods introduced in recent years to specifically address gene tree heterogeneity due to gene duplication and loss (such as DupTree, FastMulRFS, ASTRAL-Pro, and SpeciesRax), many incur high cost in terms of both running time and memory. We introduce a new approach, DISCO, that decomposes the multi-copy gene family trees into many single copy trees, which allows for methods previously designed for species tree inference in a single copy gene tree context to be used. We prove that using DISCO with ASTRAL (i.e., ASTRAL-DISCO) is statistically consistent under the GDL model, provided that ASTRAL-Pro correctly roots and tags each gene family tree. We evaluate DISCO paired with different methods for estimating species trees from single copy genes (e.g., ASTRAL, ASTRID, and IQ-TREE) under a wide range of model conditions, and establish that high accuracy can be obtained even when ASTRAL-Pro is not able to correctly roots and tags the gene family trees. We also compare results using MI, an alternative decomposition strategy from Yang Y. and Smith S.A. (2014), and find that DISCO provides better accuracy, most likely as a result of covering more of the gene family tree leafset in the output decomposition. [Concatenation analysis; gene duplication and loss; species tree inference; summary method.].
Collapse
Affiliation(s)
- James Willson
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Mrinmoy Saha Roddur
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Baqiao Liu
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Paul Zaharias
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
10
|
Baumdicker F, Bisschop G, Goldstein D, Gower G, Ragsdale AP, Tsambos G, Zhu S, Eldon B, Ellerman EC, Galloway JG, Gladstein AL, Gorjanc G, Guo B, Jeffery B, Kretzschumar WW, Lohse K, Matschiner M, Nelson D, Pope NS, Quinto-Cortés CD, Rodrigues MF, Saunack K, Sellinger T, Thornton K, van Kemenade H, Wohns AW, Wong Y, Gravel S, Kern AD, Koskela J, Ralph PL, Kelleher J. Efficient ancestry and mutation simulation with msprime 1.0. Genetics 2022; 220:iyab229. [PMID: 34897427 PMCID: PMC9176297 DOI: 10.1093/genetics/iyab229] [Citation(s) in RCA: 104] [Impact Index Per Article: 52.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 12/03/2021] [Indexed: 11/13/2022] Open
Abstract
Stochastic simulation is a key tool in population genetics, since the models involved are often analytically intractable and simulation is usually the only way of obtaining ground-truth data to evaluate inferences. Because of this, a large number of specialized simulation programs have been developed, each filling a particular niche, but with largely overlapping functionality and a substantial duplication of effort. Here, we introduce msprime version 1.0, which efficiently implements ancestry and mutation simulations based on the succinct tree sequence data structure and the tskit library. We summarize msprime's many features, and show that its performance is excellent, often many times faster and more memory efficient than specialized alternatives. These high-performance features have been thoroughly tested and validated, and built using a collaborative, open source development model, which reduces duplication of effort and promotes software quality via community engagement.
Collapse
Affiliation(s)
- Franz Baumdicker
- Cluster of Excellence “Controlling Microbes to Fight Infections”, Mathematical and Computational Population Genetics, University of Tübingen, 72076 Tübingen, Germany
| | - Gertjan Bisschop
- Institute of Evolutionary Biology, The University of Edinburgh, Edinburgh EH9 3FL, UK
| | - Daniel Goldstein
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Graham Gower
- Lundbeck GeoGenetics Centre, Globe Institute, University of Copenhagen, 1350 Copenhagen K, Denmark
| | - Aaron P Ragsdale
- Department of Integrative Biology, University of Wisconsin–Madison, Madison, WI 53706, USA
| | - Georgia Tsambos
- Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Parkville, VIC 3010, Australia
| | - Sha Zhu
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Bjarki Eldon
- Leibniz Institute for Evolution and Biodiversity Science, Museum für Naturkunde, Berlin 10115, Germany
| | | | - Jared G Galloway
- Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA
- Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA 98102, USA
| | - Ariella L Gladstein
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-7264, USA
- Embark Veterinary, Inc., Boston, MA 02111, USA
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Edinburgh EH25 9RG, UK
| | - Bing Guo
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Ben Jeffery
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Warren W Kretzschumar
- Center for Hematology and Regenerative Medicine, Karolinska Institute, 141 83 Huddinge, Sweden
| | - Konrad Lohse
- Institute of Evolutionary Biology, The University of Edinburgh, Edinburgh EH9 3FL, UK
| | | | - Dominic Nelson
- Department of Human Genetics, McGill University, Montréal, QC H3A 0C7, Canada
| | - Nathaniel S Pope
- Department of Entomology, Pennsylvania State University, State College, PA 16802, USA
| | - Consuelo D Quinto-Cortés
- National Laboratory of Genomics for Biodiversity (LANGEBIO), Unit of Advanced Genomics, CINVESTAV, Irapuato, Mexico
| | - Murillo F Rodrigues
- Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA
| | | | - Thibaut Sellinger
- Professorship for Population Genetics, Department of Life Science Systems, Technical University of Munich, 85354 Freising, Germany
| | - Kevin Thornton
- Department of Ecology and Evolutionary Biology, University of California, Irvine, CA 92697, USA
| | | | - Anthony W Wohns
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Yan Wong
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Simon Gravel
- Department of Human Genetics, McGill University, Montréal, QC H3A 0C7, Canada
| | - Andrew D Kern
- Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA
| | - Jere Koskela
- Department of Statistics, University of Warwick, Coventry CV4 7AL, UK
| | - Peter L Ralph
- Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA
- Department of Mathematics, University of Oregon, Eugene, OR 97403-5289, USA
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| |
Collapse
|
11
|
Wohns AW, Wong Y, Jeffery B, Akbari A, Mallick S, Pinhasi R, Patterson N, Reich D, Kelleher J, McVean G. A unified genealogy of modern and ancient genomes. Science 2022; 375:eabi8264. [PMID: 35201891 PMCID: PMC10027547 DOI: 10.1126/science.abi8264] [Citation(s) in RCA: 40] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
The sequencing of modern and ancient genomes from around the world has revolutionized our understanding of human history and evolution. However, the problem of how best to characterize ancestral relationships from the totality of human genomic variation remains unsolved. Here, we address this challenge with nonparametric methods that enable us to infer a unified genealogy of modern and ancient humans. This compact representation of multiple datasets explores the challenges of missing and erroneous data and uses ancient samples to constrain and date relationships. We demonstrate the power of the method to recover relationships between individuals and populations as well as to identify descendants of ancient samples. Finally, we introduce a simple nonparametric estimator of the geographical location of ancestors that recapitulates key events in human history.
Collapse
Affiliation(s)
- Anthony Wilder Wohns
- Broad Institute of MIT and Harvard; Cambridge, MA 02142, USA
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford; Oxford OX3 7LF, UK
| | - Yan Wong
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford; Oxford OX3 7LF, UK
| | - Ben Jeffery
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford; Oxford OX3 7LF, UK
| | - Ali Akbari
- Broad Institute of MIT and Harvard; Cambridge, MA 02142, USA
- Department of Human Evolutionary Biology, Harvard University; Cambridge, MA 02138, USA
- Department of Genetics, Harvard Medical School; Boston, MA 02115, USA
| | - Swapan Mallick
- Broad Institute of MIT and Harvard; Cambridge, MA 02142, USA
- Howard Hughes Medical Institute; Boston, MA 02115, USA
| | - Ron Pinhasi
- Department of Evolutionary Anthropology, University of Vienna; 1090 Vienna, Austria
| | - Nick Patterson
- Broad Institute of MIT and Harvard; Cambridge, MA 02142, USA
- Department of Human Evolutionary Biology, Harvard University; Cambridge, MA 02138, USA
- Howard Hughes Medical Institute; Boston, MA 02115, USA
- Department of Genetics, Harvard Medical School; Boston, MA 02115, USA
| | - David Reich
- Broad Institute of MIT and Harvard; Cambridge, MA 02142, USA
- Department of Human Evolutionary Biology, Harvard University; Cambridge, MA 02138, USA
- Howard Hughes Medical Institute; Boston, MA 02115, USA
- Department of Genetics, Harvard Medical School; Boston, MA 02115, USA
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford; Oxford OX3 7LF, UK
| | - Gil McVean
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford; Oxford OX3 7LF, UK
- Corresponding author.
| |
Collapse
|
12
|
Liu B, Warnow T. Scalable Species Tree Inference with External Constraints. J Comput Biol 2022; 29:664-678. [PMID: 35196115 DOI: 10.1089/cmb.2021.0543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Species tree inference is a basic step in biological discovery, but discordance between gene trees creates analytical challenges and large data sets create computational challenges. Although there is generally some information available about the species trees that could be used to speed up the estimation, only one species tree estimation method that addresses gene tree discordance-ASTRAL-J, a recent development in the ASTRAL family of methods-is able to use this information. Here we describe two new methods, NJst-J and FASTRAL-J, that can estimate the species tree, given a partial knowledge of the species tree in the form of a nonbinary unrooted constraint tree. We show that both NJst-J and FASTRAL-J are much faster than ASTRAL-J and we prove that all three methods are statistically consistent under the multispecies coalescent model subject to this constraint. Our extensive simulation study shows that both FASTRAL-J and NJst-J provide advantages over ASTRAL-J: both are faster (and NJst-J is particularly fast), and FASTRAL-J is generally at least as accurate as ASTRAL-J. An analysis of the Avian Phylogenomics Project data set with 48 species and 14,446 genes presents additional evidence of the value of FASTRAL-J over ASTRAL-J (and both over ASTRAL), with dramatic reductions in running time (20 hours for default ASTRAL, and minutes or seconds for ASTRAL-J and FASTRAL-J, respectively).
Collapse
Affiliation(s)
- Baqiao Liu
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| |
Collapse
|
13
|
Yan Z, Smith ML, Du P, Hahn MW, Nakhleh L. Species Tree Inference Methods Intended to Deal with Incomplete Lineage Sorting Are Robust to the Presence of Paralogs. Syst Biol 2022; 71:367-381. [PMID: 34245291 PMCID: PMC8978208 DOI: 10.1093/sysbio/syab056] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2020] [Revised: 06/23/2021] [Accepted: 06/30/2021] [Indexed: 11/24/2022] Open
Abstract
Many recent phylogenetic methods have focused on accurately inferring species trees when there is gene tree discordance due to incomplete lineage sorting (ILS). For almost all of these methods, and for phylogenetic methods in general, the data for each locus are assumed to consist of orthologous, single-copy sequences. Loci that are present in more than a single copy in any of the studied genomes are excluded from the data. These steps greatly reduce the number of loci available for analysis. The question we seek to answer in this study is: what happens if one runs such species tree inference methods on data where paralogy is present, in addition to or without ILS being present? Through simulation studies and analyses of two large biological data sets, we show that running such methods on data with paralogs can still provide accurate results. We use multiple different methods, some of which are based directly on the multispecies coalescent model, and some of which have been proven to be statistically consistent under it. We also treat the paralogous loci in multiple ways: from explicitly denoting them as paralogs, to randomly selecting one copy per species. In all cases, the inferred species trees are as accurate as equivalent analyses using single-copy orthologs. Our results have significant implications for the use of ILS-aware phylogenomic analyses, demonstrating that they do not have to be restricted to single-copy loci. This will greatly increase the amount of data that can be used for phylogenetic inference.[Gene duplication and loss; incomplete lineage sorting; multispecies coalescent; orthology; paralogy.].
Collapse
Affiliation(s)
- Zhi Yan
- Department of Computer Science, Rice University,
6100 Main Street, Houston, TX 77005, USA
| | - Megan L Smith
- Department of Biology and Department of Computer Science,
Indiana University, 1001 East Third Street, Bloomington,
IN 47405, USA
| | - Peng Du
- Department of Computer Science, Rice University,
6100 Main Street, Houston, TX 77005, USA
| | - Matthew W Hahn
- Department of Biology and Department of Computer Science,
Indiana University, 1001 East Third Street, Bloomington,
IN 47405, USA
| | - Luay Nakhleh
- Department of Computer Science, Rice University,
6100 Main Street, Houston, TX 77005, USA
- Department of BioSciences, Rice University, 6100
Main Street, Houston, TX 77005, USA
| |
Collapse
|
14
|
Hibbins MS, Hahn MW. Phylogenomic approaches to detecting and characterizing introgression. Genetics 2022; 220:iyab173. [PMID: 34788444 PMCID: PMC9208645 DOI: 10.1093/genetics/iyab173] [Citation(s) in RCA: 51] [Impact Index Per Article: 25.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Accepted: 10/02/2021] [Indexed: 12/26/2022] Open
Abstract
Phylogenomics has revealed the remarkable frequency with which introgression occurs across the tree of life. These discoveries have been enabled by the rapid growth of methods designed to detect and characterize introgression from whole-genome sequencing data. A large class of phylogenomic methods makes use of data across species to infer and characterize introgression based on expectations from the multispecies coalescent. These methods range from simple tests, such as the D-statistic, to model-based approaches for inferring phylogenetic networks. Here, we provide a detailed overview of the various signals that different modes of introgression are expected leave in the genome, and how current methods are designed to detect them. We discuss the strengths and pitfalls of these approaches and identify areas for future development, highlighting the different signals of introgression, and the power of each method to detect them. We conclude with a discussion of current challenges in inferring introgression and how they could potentially be addressed.
Collapse
Affiliation(s)
- Mark S Hibbins
- Department of Biology, Indiana University, Bloomington, IN 47405, USA
| | - Matthew W Hahn
- Department of Biology, Indiana University, Bloomington, IN 47405, USA
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA
| |
Collapse
|
15
|
Jiao X, Flouri T, Yang Z. Multispecies coalescent and its applications to infer species phylogenies and cross-species gene flow. Natl Sci Rev 2022; 8:nwab127. [PMID: 34987842 PMCID: PMC8692950 DOI: 10.1093/nsr/nwab127] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Revised: 07/10/2021] [Accepted: 07/11/2021] [Indexed: 02/06/2023] Open
Abstract
Multispecies coalescent (MSC) is the extension of the single-population coalescent model to multiple species. It integrates the phylogenetic process of species divergences and the population genetic process of coalescent, and provides a powerful framework for a number of inference problems using genomic sequence data from multiple species, including estimation of species divergence times and population sizes, estimation of species trees accommodating discordant gene trees, inference of cross-species gene flow and species delimitation. In this review, we introduce the major features of the MSC model, discuss full-likelihood and heuristic methods of species tree estimation and summarize recent methodological advances in inference of cross-species gene flow. We discuss the statistical and computational challenges in the field and research directions where breakthroughs may be likely in the next few years.
Collapse
Affiliation(s)
- Xiyun Jiao
- Department of Genetics, Evolution and Environment, University College London, London WC1E 6BT, UK
| | - Tomáš Flouri
- Department of Genetics, Evolution and Environment, University College London, London WC1E 6BT, UK
| | - Ziheng Yang
- Department of Genetics, Evolution and Environment, University College London, London WC1E 6BT, UK
| |
Collapse
|
16
|
Hibbins MS, Hahn MW. The effects of introgression across thousands of quantitative traits revealed by gene expression in wild tomatoes. PLoS Genet 2021; 17:e1009892. [PMID: 34748547 PMCID: PMC8601620 DOI: 10.1371/journal.pgen.1009892] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2021] [Revised: 11/18/2021] [Accepted: 10/18/2021] [Indexed: 01/13/2023] Open
Abstract
It is now understood that introgression can serve as powerful evolutionary force, providing genetic variation that can shape the course of trait evolution. Introgression also induces a shared evolutionary history that is not captured by the species phylogeny, potentially complicating evolutionary analyses that use a species tree. Such analyses are often carried out on gene expression data across species, where the measurement of thousands of trait values allows for powerful inferences while controlling for shared phylogeny. Here, we present a Brownian motion model for quantitative trait evolution under the multispecies network coalescent framework, demonstrating that introgression can generate apparently convergent patterns of evolution when averaged across thousands of quantitative traits. We test our theoretical predictions using whole-transcriptome expression data from ovules in the wild tomato genus Solanum. Examining two sub-clades that both have evidence for post-speciation introgression, but that differ substantially in its magnitude, we find patterns of evolution that are consistent with histories of introgression in both the sign and magnitude of ovule gene expression. Additionally, in the sub-clade with a higher rate of introgression, we observe a correlation between local gene tree topology and expression similarity, implicating a role for introgressed cis-regulatory variation in generating these broad-scale patterns. Our results reveal a general role for introgression in shaping patterns of variation across many thousands of quantitative traits, and provide a framework for testing for these effects using simple model-informed predictions. It is now known from studying large genetic datasets that species often hybridize and cross with each other over many generations – a phenomenon known as introgression. Introgression introduces new genetic variation into a population, and this variation can cause traits to be shared among the introgressing species. When researchers study the evolution of trait variation among species, this source of trait sharing is rarely accounted for. Here, we present a statistical model of the effects of introgression on trait variation. This model predicts that, when averaged across many thousands of traits, introgressing species are consistently more similar than expected from standard approaches. Researchers studying gene expression often consider the expression of many thousands of genes, making this a case where the expected effects of introgression are likely to manifest. We tested our model prediction using ovule gene expression data from the wild tomato genus Solanum, in two groups of species with evidence of historical introgression. We found that patterns of expression similarity in both groups are consistent with their histories of introgression and the predictions from our model. Our results highlight the importance of accounting for introgression as a source of trait variation among species.
Collapse
Affiliation(s)
- Mark S. Hibbins
- Department of Biology, Indiana University, Bloomington, Indiana, United States of America
- * E-mail:
| | - Matthew W. Hahn
- Department of Biology, Indiana University, Bloomington, Indiana, United States of America
- Department of Computer Science, Indiana University, Bloomington, Indiana, United States of America
| |
Collapse
|
17
|
Prabh N, Tautz D. Frequent lineage-specific substitution rate changes support an episodic model for protein evolution. G3-GENES GENOMES GENETICS 2021; 11:6372692. [PMID: 34542594 PMCID: PMC8664490 DOI: 10.1093/g3journal/jkab333] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/06/2021] [Accepted: 09/13/2021] [Indexed: 12/04/2022]
Abstract
Since the inception of the molecular clock model for sequence evolution, the investigation of protein divergence has revolved around the question of a more or less constant change of amino acid sequences, with specific overall rates for each family. Although anomalies in clock-like divergence are well known, the assumption of a constant decay rate for a given protein family is usually taken as the null model for protein evolution. However, systematic tests of this null model at a genome-wide scale have lagged behind, despite the databases’ enormous growth. We focus here on divergence rate comparisons between very closely related lineages since this allows clear orthology assignments by synteny and reliable alignments, which are crucial for determining substitution rate changes. We generated a high-confidence dataset of syntenic orthologs from four ape species, including humans. We find that despite the appearance of an overall clock-like substitution pattern, several hundred protein families show lineage-specific acceleration and deceleration in divergence rates, or combinations of both in different lineages. Hence, our analysis uncovers a rather dynamic history of substitution rate changes, even between these closely related lineages, implying that one should expect that a large fraction of proteins will have had a history of episodic rate changes in deeper phylogenies. Furthermore, each of the lineages has a separate set of particularly fast diverging proteins. The genes with the highest percentage of branch-specific substitutions are ADCYAP1 in the human lineage (9.7%), CALU in chimpanzees (7.1%), SLC39A14 in the internal branch leading to humans and chimpanzees (4.1%), RNF128 in gorillas (9%), and S100Z in gibbons (15.2%). The mutational pattern in ADCYAP1 suggests a biased mutation process, possibly through asymmetric gene conversion effects. We conclude that a null model of constant change can be problematic for predicting the evolutionary trajectories of individual proteins.
Collapse
Affiliation(s)
- Neel Prabh
- Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, August-Thienemann-Str. 2, 24306 Plön, Germany
| | - Diethard Tautz
- Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, August-Thienemann-Str. 2, 24306 Plön, Germany
| |
Collapse
|
18
|
Farah IT, Islam MM, Zinat KT, Rahman AH, Bayzid MS. Species tree estimation from gene trees by minimizing deep coalescence and maximizing quartet consistency: a comparative study and the presence of pseudo species tree terraces. Syst Biol 2021; 70:1213-1231. [PMID: 33844023 DOI: 10.1093/sysbio/syab026] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2020] [Revised: 03/25/2021] [Accepted: 03/29/2021] [Indexed: 11/14/2022] Open
Abstract
Species tree estimation from multi-locus datasets is extremely challenging, especially in the presence of gene tree heterogeneity across the genome due to incomplete lineage sorting (ILS). Summary methods have been developed which estimate gene trees and then combine the gene trees to estimate a species tree by optimizing various optimization scores. In this study, we have extended and adapted the concept of phylogenetic terraces to species tree estimation by "summarizing" a set of gene trees, where multiple species trees with distinct topologies may have exactly the same optimality score (i.e., quartet score, extra lineage score, etc.). We particularly investigated the presence and impacts of equally optimal trees in species tree estimation from multi-locus data using summary methods by taking ILS into account. We analyzed two of the most popular ILS-aware optimization criteria: maximize quartet consistency (MQC) and minimize deep coalescence (MDC). Methods based on MQC are provably statistically consistent, whereas MDC is not a consistent criterion for species tree estimation. We present a comprehensive comparative study of these two optimality criteria. Our experiments, on a collection of datasets simulated under ILS, indicate that MDC may result in competitive or identical quartet consistency score as MQC, but could be significantly worse than MQC in terms of tree accuracy - demonstrating the presence and impacts of equally optimal species trees. This is the first known study that provides the conditions for the datasets to have equally optimal trees in the context of phylogenomic inference using summary methods.
Collapse
Affiliation(s)
- Ishrat Tanzila Farah
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology Dhaka-1205, Bangladesh
| | - Md Muktadirul Islam
- Applied Statistics and Data Science (ASDS), Department of Statistics Jahangirnagar University Dhaka-1342, Bangladesh
| | - Kazi Tasnim Zinat
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology Dhaka-1205, Bangladesh.,Department of Computer Science University of Maryland, College Park, Maryland, USA
| | - Atif Hasan Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology Dhaka-1205, Bangladesh
| | - Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology Dhaka-1205, Bangladesh
| |
Collapse
|
19
|
Allman ES, Mitchell JD, Rhodes JA. Gene tree discord, simplex plots, and statistical tests under the coalescent. Syst Biol 2021; 71:929-942. [PMID: 33560348 DOI: 10.1093/sysbio/syab008] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Revised: 01/31/2021] [Accepted: 02/03/2021] [Indexed: 02/06/2023] Open
Abstract
A simple graphical device, the simplex plot of quartet concordance factors, is introduced to aid in the exploration of a collection of gene trees on a common set of taxa. A single plot summarizes all gene tree discord, and allows for visual comparison to the expected discord from the multispecies coalescent model (MSC) of incomplete lineage sorting on a species tree. A formal statistical procedure is described that can quantify the deviation from expectation for each subset of four taxa, suggesting when the data is not in accord with the MSC, and thus that either gene tree inference error is substantial or a more complex model such as that on a network may be required. If the collection of gene trees is in accord with the MSC, the plots reveal when substantial incomplete lineage sorting is present. Applications to both simulated and empirical multilocus data sets illustrate the insights provided.
Collapse
Affiliation(s)
- Elizabeth S Allman
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK 99709, USA
| | - Jonathan D Mitchell
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK 99709, USA.,Unité Bioinformatique Evolutive, C3BI USR 3756, Institut Pasteur & CNRS, Paris, France
| | - John A Rhodes
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK 99709, USA
| |
Collapse
|
20
|
|
21
|
Zhu T, Yang Z. Complexity of the simplest species tree problem. Mol Biol Evol 2021; 38:3993-4009. [PMID: 33492385 PMCID: PMC8382899 DOI: 10.1093/molbev/msab009] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2020] [Revised: 01/04/2021] [Accepted: 01/13/2021] [Indexed: 02/06/2023] Open
Abstract
The multispecies coalescent model provides a natural framework for species tree estimation accounting for gene-tree conflicts. Although a number of species tree methods under the multispecies coalescent have been suggested and evaluated using simulation, their statistical properties remain poorly understood. Here, we use mathematical analysis aided by computer simulation to examine the identifiability, consistency, and efficiency of different species tree methods in the case of three species and three sequences under the molecular clock. We consider four major species-tree methods including concatenation, two-step, independent-sites maximum likelihood, and maximum likelihood. We develop approximations that predict that the probit transform of the species tree estimation error decreases linearly with the square root of the number of loci. Even in this simplest case, major differences exist among the methods. Full-likelihood methods are considerably more efficient than summary methods such as concatenation and two-step. They also provide estimates of important parameters such as species divergence times and ancestral population sizes,whereas these parameters are not identifiable by summary methods. Our results highlight the need to improve the statistical efficiency of summary methods and the computational efficiency of full likelihood methods of species tree estimation.
Collapse
Affiliation(s)
- Tianqi Zhu
- Institute of Applied Mathematics, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China.,Key Laboratory of Random Complex Structures and Data Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
| | - Ziheng Yang
- Institute of Applied Mathematics, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China.,Department of Genetics, University College London, Gower Street, London WC1E 6BT, UK
| |
Collapse
|
22
|
Truszkowski J, Scornavacca C, Pardi F. Computing the probability of gene trees concordant with the species tree in the multispecies coalescent. Theor Popul Biol 2020; 137:22-31. [PMID: 33333117 DOI: 10.1016/j.tpb.2020.12.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2020] [Revised: 12/04/2020] [Accepted: 12/08/2020] [Indexed: 10/22/2022]
Abstract
The multispecies coalescent process models the genealogical relationships of genes sampled from several species, enabling useful predictions about phenomena such as the discordance between a gene tree and the species phylogeny due to incomplete lineage sorting. Conversely, knowledge of large collections of gene trees can inform us about several aspects of the species phylogeny, such as its topology and ancestral population sizes. A fundamental open problem in this context is how to efficiently compute the probability of a gene tree topology, given the species phylogeny. Although a number of algorithms for this task have been proposed, they either produce approximate results, or, when they are exact, they do not scale to large data sets. In this paper, we present some progress towards exact and efficient computation of the probability of a gene tree topology. We provide a new algorithm that, given a species tree and the number of genes sampled for each species, calculates the probability that the gene tree topology will be concordant with the species tree. Moreover, we provide an algorithm that computes the probability of any specific gene tree topology concordant with the species tree. Both algorithms run in polynomial time and have been implemented in Python. Experiments show that they are able to analyze data sets where thousands of genes are sampled in a matter of minutes to hours.
Collapse
Affiliation(s)
| | - Celine Scornavacca
- ISEM, CNRS, Université Montpellier, Montpellier, France; Institut de Biologie Computationnelle, Montpellier, France
| | - Fabio Pardi
- LIRMM, CNRS, Université Montpellier, Montpellier, France; Institut de Biologie Computationnelle, Montpellier, France.
| |
Collapse
|
23
|
Hibbins MS, Gibson MJS, Hahn MW. Determining the probability of hemiplasy in the presence of incomplete lineage sorting and introgression. eLife 2020; 9:e63753. [PMID: 33345772 PMCID: PMC7800383 DOI: 10.7554/elife.63753] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2020] [Accepted: 12/18/2020] [Indexed: 12/11/2022] Open
Abstract
The incongruence of character states with phylogenetic relationships is often interpreted as evidence of convergent evolution. However, trait evolution along discordant gene trees can also generate these incongruences - a phenomenon known as hemiplasy. Classic comparative methods do not account for discordance, resulting in incorrect inferences about the number, timing, and direction of trait transitions. Biological sources of discordance include incomplete lineage sorting (ILS) and introgression, but only ILS has received theoretical consideration in the context of hemiplasy. Here, we present a model that shows introgression makes hemiplasy more likely, such that methods that account for ILS alone will be conservative. We also present a method and software (HeIST) for making statistical inferences about the probability of hemiplasy and homoplasy in large datasets that contain both ILS and introgression. We apply our methods to two empirical datasets, finding that hemiplasy is likely to contribute to the observed trait incongruences in both.
Collapse
Affiliation(s)
- Mark S Hibbins
- Department of Biology, Indiana UniversityBloomingtonUnited States
| | | | - Matthew W Hahn
- Department of Biology, Indiana UniversityBloomingtonUnited States
- Department of Computer Science, Indiana UniversityBloomingtonUnited States
| |
Collapse
|
24
|
Koch H, DeGiorgio M. Maximum Likelihood Estimation of Species Trees from Gene Trees in the Presence of Ancestral Population Structure. Genome Biol Evol 2020; 12:3977-3995. [PMID: 32022857 PMCID: PMC7061232 DOI: 10.1093/gbe/evaa022] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/23/2020] [Indexed: 11/12/2022] Open
Abstract
Though large multilocus genomic data sets have led to overall improvements in phylogenetic inference, they have posed the new challenge of addressing conflicting signals across the genome. In particular, ancestral population structure, which has been uncovered in a number of diverse species, can skew gene tree frequencies, thereby hindering the performance of species tree estimators. Here we develop a novel maximum likelihood method, termed TASTI (Taxa with Ancestral structure Species Tree Inference), that can infer phylogenies under such scenarios, and find that it has increasing accuracy with increasing numbers of input gene trees, contrasting with the relatively poor performances of methods not tailored for ancestral structure. Moreover, we propose a supertree approach that allows TASTI to scale computationally with increasing numbers of input taxa. We use genetic simulations to assess TASTI's performance in the three- and four-taxon settings and demonstrate the application of TASTI on a six-species Afrotropical mosquito data set. Finally, we have implemented TASTI in an open-source software package for ease of use by the scientific community.
Collapse
Affiliation(s)
- Hillary Koch
- Department of Statistics, Pennsylvania State University
| | - Michael DeGiorgio
- Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University
| |
Collapse
|
25
|
Blair C, Ané C. Phylogenetic Trees and Networks Can Serve as Powerful and Complementary Approaches for Analysis of Genomic Data. Syst Biol 2020; 69:593-601. [PMID: 31432090 DOI: 10.1093/sysbio/syz056] [Citation(s) in RCA: 54] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2019] [Accepted: 08/15/2019] [Indexed: 11/14/2022] Open
Abstract
Genomic data have had a profound impact on nearly every biological discipline. In systematics and phylogenetics, the thousands of loci that are now being sequenced can be analyzed under the multispecies coalescent model (MSC) to explicitly account for gene tree discordance due to incomplete lineage sorting (ILS). However, the MSC assumes no gene flow post divergence, calling for additional methods that can accommodate this limitation. Explicit phylogenetic network methods have emerged, which can simultaneously account for ILS and gene flow by representing evolutionary history as a directed acyclic graph. In this point of view, we highlight some of the strengths and limitations of phylogenetic networks and argue that tree-based inference should not be blindly abandoned in favor of networks simply because they represent more parameter rich models. Attention should be given to model selection of reticulation complexity, and the most robust conclusions regarding evolutionary history are likely obtained when combining tree- and network-based inference.
Collapse
Affiliation(s)
- Christopher Blair
- Department of Biological Sciences, New York City College of Technology, The City University of New York, 285 Jay Street, Brooklyn, NY 11201, USA
- Biology PhD Program, CUNY Graduate Center, 365 5th Ave., New York, NY 10016, USA
| | - Cécile Ané
- Department of Botany, University of Wisconsin - Madison, 1300 University Ave, Madison, WI 53706, USA
- Department of Statistics, University of Wisconsin - Madison, 1300 University Ave, Madison, WI 53706, USA
| |
Collapse
|
26
|
Jiang Y, Yuan Z, Hu H, Ye X, Zheng Z, Wei Y, Zheng YL, Wang YG, Liu C. Differentiating homoploid hybridization from ancestral subdivision in evaluating the origin of the D lineage in wheat. THE NEW PHYTOLOGIST 2020; 228:409-414. [PMID: 32255512 DOI: 10.1111/nph.16578] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/07/2019] [Accepted: 03/19/2020] [Indexed: 06/11/2023]
Affiliation(s)
- Yunfeng Jiang
- Triticeae Research Institute, Sichuan Agricultural University, Wenjiang, Chengdu, 611130, China
- CSIRO Agriculture and Food, St Lucia, Qld, 4067, Australia
| | - Zhongwei Yuan
- Triticeae Research Institute, Sichuan Agricultural University, Wenjiang, Chengdu, 611130, China
- CSIRO Agriculture and Food, St Lucia, Qld, 4067, Australia
| | - Haiyan Hu
- CSIRO Agriculture and Food, St Lucia, Qld, 4067, Australia
- College of Life Science and Technology, Henan Institute of Science and Technology, Xinxiang, Henan, 453003, China
| | - Xueling Ye
- Triticeae Research Institute, Sichuan Agricultural University, Wenjiang, Chengdu, 611130, China
- CSIRO Agriculture and Food, St Lucia, Qld, 4067, Australia
| | - Zhi Zheng
- CSIRO Agriculture and Food, St Lucia, Qld, 4067, Australia
| | - Yuming Wei
- Triticeae Research Institute, Sichuan Agricultural University, Wenjiang, Chengdu, 611130, China
| | - You-Liang Zheng
- Triticeae Research Institute, Sichuan Agricultural University, Wenjiang, Chengdu, 611130, China
| | - You-Gan Wang
- Science and Engineering Facility, Queensland University of Technology, Brisbane, Qld, 4000, Australia
| | - Chunji Liu
- CSIRO Agriculture and Food, St Lucia, Qld, 4067, Australia
| |
Collapse
|
27
|
McKenzie PF, Eaton DAR. ipcoal: an interactive Python package for simulating and analyzing genealogies and sequences on a species tree or network. Bioinformatics 2020; 36:4193-4196. [PMID: 32399564 DOI: 10.1093/bioinformatics/btaa486] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2020] [Revised: 05/01/2020] [Accepted: 05/05/2020] [Indexed: 11/13/2022] Open
Abstract
SUMMARY ipcoal is a free and open source Python package for simulating and analyzing genealogies and sequences. It automates the task of describing complex demographic models (e.g. with divergence times, effective population sizes, migration events) to the msprime coalescent simulator by parsing a user-supplied species tree or network. Genealogies, sequences and metadata are returned in tabular format allowing for easy downstream analyses. ipcoal includes phylogenetic inference tools to automate gene tree inference from simulated sequence data, and visualization tools for analyzing results and verifying model accuracy. The ipcoal package is a powerful tool for posterior predictive data analysis, for methods validation and for teaching coalescent methods in an interactive and visual environment. AVAILABILITY AND IMPLEMENTATION Source code is available from the GitHub repository (https://github.com/pmckenz1/ipcoal/) and is distributed for packaged installation with conda. Complete documentation and interactive notebooks prepared for teaching purposes, including an empirical example, are available at https://ipcoal.readthedocs.io/. CONTACT p.mckenzie@columbia.edu.
Collapse
Affiliation(s)
- Patrick F McKenzie
- Department of Ecology, Evolution and Environmental Biology, Columbia University, New York, NY 10027, USA
| | - Deren A R Eaton
- Department of Ecology, Evolution and Environmental Biology, Columbia University, New York, NY 10027, USA
| |
Collapse
|
28
|
Molecular Clocks without Rocks: New Solutions for Old Problems. Trends Genet 2020; 36:845-856. [PMID: 32709458 DOI: 10.1016/j.tig.2020.06.002] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2020] [Revised: 06/02/2020] [Accepted: 06/11/2020] [Indexed: 02/07/2023]
Abstract
Molecular data have been used to date species divergences ever since they were described as documents of evolutionary history in the 1960s. Yet, an inadequate fossil record and discordance between gene trees and species trees are persistently problematic. We examine how, by accommodating gene tree discordance and by scaling branch lengths to absolute time using mutation rate and generation time, multispecies coalescent (MSC) methods can potentially overcome these challenges. We find that time estimates can differ - in some cases, substantially - depending on whether MSC methods or traditional phylogenetic methods that apply concatenation are used, and whether the tree is calibrated with pedigree-based mutation rates or with fossils. We discuss the advantages and shortcomings of both approaches and provide practical guidance for data analysis when using these methods.
Collapse
|
29
|
Wakeley J. Developments in coalescent theory from single loci to chromosomes. Theor Popul Biol 2020; 133:56-64. [DOI: 10.1016/j.tpb.2020.02.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2019] [Revised: 02/19/2020] [Accepted: 02/26/2020] [Indexed: 10/24/2022]
|
30
|
Jiao X, Flouri T, Rannala B, Yang Z. The Impact of Cross-Species Gene Flow on Species Tree Estimation. Syst Biol 2020; 69:830-847. [DOI: 10.1093/sysbio/syaa001] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2018] [Revised: 11/12/2019] [Accepted: 01/15/2020] [Indexed: 12/26/2022] Open
Abstract
Abstract
Recent analyses of genomic sequence data suggest cross-species gene flow is common in both plants and animals, posing challenges to species tree estimation. We examine the levels of gene flow needed to mislead species tree estimation with three species and either episodic introgressive hybridization or continuous migration between an outgroup and one ingroup species. Several species tree estimation methods are examined, including the majority-vote method based on the most common gene tree topology (with either the true or reconstructed gene trees used), the UPGMA method based on the average sequence distances (or average coalescent times) between species, and the full-likelihood method based on multilocus sequence data. Our results suggest that the majority-vote method based on gene tree topologies is more robust to gene flow than the UPGMA method based on coalescent times and both are more robust than likelihood assuming a multispecies coalescent (MSC) model with no cross-species gene flow. Comparison of the continuous migration model with the episodic introgression model suggests that a small amount of gene flow per generation can cause drastic changes to the genetic history of the species and mislead species tree methods, especially if the species diverged through radiative speciation events. Estimates of parameters under the MSC with gene flow suggest that African mosquito species in the Anopheles gambiae species complex constitute such an example of extreme impact of gene flow on species phylogeny. [IM; introgression; migration; MSci; multispecies coalescent; species tree.]
Collapse
Affiliation(s)
- Xiyun Jiao
- Department of Genetics, University College London, Gower Street, London WC1E 6BT, UK
| | - Tomáš Flouri
- Department of Genetics, University College London, Gower Street, London WC1E 6BT, UK
| | - Bruce Rannala
- Department of Evolution and Ecology, University of California, Davis, CA 95616, USA
| | - Ziheng Yang
- Department of Genetics, University College London, Gower Street, London WC1E 6BT, UK
| |
Collapse
|
31
|
Chung Y. Recent advances in Bayesian inference of isolation-with-migration models. Genomics Inform 2020; 17:e37. [PMID: 31896237 PMCID: PMC6944047 DOI: 10.5808/gi.2019.17.4.e37] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2019] [Accepted: 10/23/2019] [Indexed: 12/03/2022] Open
Abstract
Isolation-with-migration (IM) models have become popular for explaining population divergence in the presence of migrations. Bayesian methods are commonly used to estimate IM models, but they are limited to small data analysis or simple model inference. Recently three methods, IMa3, MIST, and AIM, resolved these limitations. Here, we describe the major problems addressed by these three software and compare differences among their inference methods, despite their use of the same standard likelihood function.
Collapse
Affiliation(s)
- Yujin Chung
- Department of Applied Statistics, Kyonggi University, Suwon 16227, Korea
| |
Collapse
|
32
|
Abstract
Coalescent simulation is a fundamental tool in modern population genetics. The msprime library provides unprecedented scalability in terms of both the simulations that can be performed and the efficiency with which the results can be processed. We show how coalescent models for population structure and demography can be constructed using a simple Python API, as well as how we can process the results of such simulations to efficiently calculate statistics of interest. We illustrate msprime's flexibility by implementing a simple (but functional) approximate Bayesian computation inference method in just a few tens of lines of code.
Collapse
Affiliation(s)
- Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK.
| | - Konrad Lohse
- Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, UK
| |
Collapse
|
33
|
Abstract
Abstract
Many methods exist for detecting introgression between nonsister species, but the most commonly used require either a single sequence from four or more taxa or multiple sequences from each of three taxa. Here, we present a test for introgression that uses only a single sequence from three taxa. This test, denoted D3, uses similar logic as the standard D-test for introgression, but by using pairwise distances instead of site patterns it is able to detect the same signal of introgression with fewer species. We use simulations to show that D3 has statistical power almost equal to D, demonstrating its use on a data set of wild bananas (Musa). The new test is easy to apply and easy to interpret, and should find wide use among currently available data sets.
Collapse
Affiliation(s)
- Matthew W Hahn
- Department of Biology, Indiana University, Bloomington, IN
- Department of Computer Science, Indiana University, Bloomington, IN
| | - Mark S Hibbins
- Department of Biology, Indiana University, Bloomington, IN
| |
Collapse
|
34
|
On the unranked topology of maximally probable ranked gene tree topologies. J Math Biol 2019; 79:1205-1225. [DOI: 10.1007/s00285-019-01392-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2018] [Revised: 04/05/2019] [Indexed: 10/26/2022]
|
35
|
Peyrégne S, Slon V, Mafessoni F, de Filippo C, Hajdinjak M, Nagel S, Nickel B, Essel E, Le Cabec A, Wehrberger K, Conard NJ, Kind CJ, Posth C, Krause J, Abrams G, Bonjean D, Di Modica K, Toussaint M, Kelso J, Meyer M, Pääbo S, Prüfer K. Nuclear DNA from two early Neandertals reveals 80,000 years of genetic continuity in Europe. SCIENCE ADVANCES 2019; 5:eaaw5873. [PMID: 31249872 PMCID: PMC6594762 DOI: 10.1126/sciadv.aaw5873] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/07/2019] [Accepted: 05/22/2019] [Indexed: 06/09/2023]
Abstract
Little is known about the population history of Neandertals over the hundreds of thousands of years of their existence. We retrieved nuclear genomic sequences from two Neandertals, one from Hohlenstein-Stadel Cave in Germany and the other from Scladina Cave in Belgium, who lived around 120,000 years ago. Despite the deeply divergent mitochondrial lineage present in the former individual, both Neandertals are genetically closer to later Neandertals from Europe than to a roughly contemporaneous individual from Siberia. That the Hohlenstein-Stadel and Scladina individuals lived around the time of their most recent common ancestor with later Neandertals suggests that all later Neandertals trace at least part of their ancestry back to these early European Neandertals.
Collapse
Affiliation(s)
- Stéphane Peyrégne
- Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, Leipzig04103, Germany
| | - Viviane Slon
- Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, Leipzig04103, Germany
| | - Fabrizio Mafessoni
- Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, Leipzig04103, Germany
| | - Cesare de Filippo
- Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, Leipzig04103, Germany
| | - Mateja Hajdinjak
- Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, Leipzig04103, Germany
| | - Sarah Nagel
- Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, Leipzig04103, Germany
| | - Birgit Nickel
- Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, Leipzig04103, Germany
| | - Elena Essel
- Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, Leipzig04103, Germany
| | - Adeline Le Cabec
- Department of Human Evolution, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, Leipzig04103, Germany
| | | | - Nicholas J. Conard
- Department of Early Prehistory and Quaternary Ecology, University of Tübingen, Schloss Hohentübingen, Tübingen72070, Germany
| | - Claus Joachim Kind
- State Office for Cultural Heritage Baden-Württemberg Berliner Strasse 12, Esslingen 73728 Germany
| | - Cosimo Posth
- Max Planck Institute for the Science of Human History, Khalaische Strasse 10, Jena07745, Germany
| | - Johannes Krause
- Max Planck Institute for the Science of Human History, Khalaische Strasse 10, Jena07745, Germany
| | | | | | | | | | - Janet Kelso
- Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, Leipzig04103, Germany
| | - Matthias Meyer
- Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, Leipzig04103, Germany
| | - Svante Pääbo
- Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, Leipzig04103, Germany
| | - Kay Prüfer
- Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, Leipzig04103, Germany
- Max Planck Institute for the Science of Human History, Khalaische Strasse 10, Jena07745, Germany
| |
Collapse
|
36
|
Koskela J, Wilke Berenguer M. Robust model selection between population growth and multiple merger coalescents. Math Biosci 2019; 311:1-12. [PMID: 30851276 DOI: 10.1016/j.mbs.2019.03.004] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2018] [Revised: 03/05/2019] [Accepted: 03/05/2019] [Indexed: 11/24/2022]
Abstract
We study the effect of biological confounders on the model selection problem between Kingman coalescents with population growth, and Ξ-coalescents involving simultaneous multiple mergers. We use a low dimensional, computationally tractable summary statistic, dubbed the singleton-tail statistic, to carry out approximate likelihood ratio tests between these model classes. The singleton-tail statistic has been shown to distinguish between them with high power in the simple setting of neutrally evolving, panmictic populations without recombination. We extend this work by showing that cryptic recombination and selection do not diminish the power of the test, but that misspecifying population structure does. Furthermore, we demonstrate that the singleton-tail statistic can also solve the more challenging model selection problem between multiple mergers due to selective sweeps, and multiple mergers due to high fecundity with moderate power of up to 30%.
Collapse
Affiliation(s)
- Jere Koskela
- Department of Statistics, University of Warwick, Coventry CV4 7AL, UK.
| | - Maite Wilke Berenguer
- Fakultät für Mathematik, Ruhr Universität Bochum, Universitätstraße 150, Bochum 44780, Germany.
| |
Collapse
|
37
|
The Timing and Direction of Introgression Under the Multispecies Network Coalescent. Genetics 2019; 211:1059-1073. [PMID: 30670542 DOI: 10.1534/genetics.118.301831] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2018] [Accepted: 01/21/2019] [Indexed: 12/26/2022] Open
Abstract
Introgression is a pervasive biological process, and many statistical methods have been developed to infer its presence from genomic data. However, many of the consequences and genomic signatures of introgression remain unexplored from a methodological standpoint. Here, we develop a model for the timing and direction of introgression based on the multispecies network coalescent, and from it suggest new approaches for testing introgression hypotheses. We suggest two new statistics, D 1 and D 2, which can be used in conjunction with other information to test hypotheses relating to the timing and direction of introgression, respectively. D 1 may find use in evaluating cases of homoploid hybrid speciation (HHS), while D 2 provides a four-taxon test for polarizing introgression. Although analytical expectations for our statistics require a number of assumptions to be met, we show how simulations can be used to test hypotheses about introgression when these assumptions are violated. We apply the D 1 statistic to genomic data from the wild yeast Saccharomyces paradoxus-a proposed example of HHS-demonstrating its use as a test of this model. These methods provide new and powerful ways to address questions relating to the timing and direction of introgression.
Collapse
|
38
|
Abstract
Convergent evolution provides key evidence for the action of natural selection. The process of convergence is often inferred because the same trait appears in multiple species that are not closely related. However, different parts of the genome can reveal different relationships among species, with some genes or regions uniting lineages that appear unrelated in the species tree. If changes in traits occur in these discordant regions, a false pattern of convergence can be produced (known as “hemiplasy”). Here, we provide a way to quantify the probability that hemiplasy occurs and contrast it with the probability of convergence. We find that hemiplasy is likely to explain many apparent cases of convergent evolution, even when the fraction of discordant regions is low. Convergent evolution—the appearance of the same character state in apparently unrelated organisms—is often inferred when a trait is incongruent with the species tree. However, trait incongruence can also arise from changes that occur on discordant gene trees, a process referred to as hemiplasy. Hemiplasy is rarely taken into account in studies of convergent evolution, despite the fact that phylogenomic studies have revealed rampant discordance. Here, we study the relative probabilities of homoplasy (including convergence and reversal) and hemiplasy for an incongruent trait. We derive expressions for the probabilities of the two events, showing that they depend on many of the same parameters. We find that hemiplasy is as likely—or more likely—than homoplasy for a wide range of conditions, even when levels of discordance are low. We also present a method to calculate the ratio of these two probabilities (the “hemiplasy risk factor”) along the branches of a phylogeny of arbitrary length. Such calculations can be applied to any tree to identify when and where incongruent traits may be due to hemiplasy.
Collapse
|
39
|
Mendes FK, Fuentes-González JA, Schraiber JG, Hahn MW. A multispecies coalescent model for quantitative traits. eLife 2018; 7:e36482. [PMID: 29969096 PMCID: PMC6092125 DOI: 10.7554/elife.36482] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2018] [Accepted: 07/02/2018] [Indexed: 11/13/2022] Open
Abstract
We present a multispecies coalescent model for quantitative traits that allows for evolutionary inferences at micro- and macroevolutionary scales. A major advantage of this model is its ability to incorporate genealogical discordance underlying a quantitative trait. We show that discordance causes a decrease in the expected trait covariance between more closely related species relative to more distantly related species. If unaccounted for, this outcome can lead to an overestimation of a trait's evolutionary rate, to a decrease in its phylogenetic signal, and to errors when examining shifts in mean trait values. The number of loci controlling a quantitative trait appears to be irrelevant to all trends reported, and discordance also affected discrete, threshold traits. Our model and analyses point to the conditions under which different methods should fare better or worse, in addition to indicating current and future approaches that can mitigate the effects of discordance.
Collapse
Affiliation(s)
- Fábio K Mendes
- Department of BiologyIndiana UniversityBloomingtonUnited States
| | - Jesualdo A Fuentes-González
- Department of BiologyIndiana UniversityBloomingtonUnited States
- School of Life SciencesArizona State UniversityTempeUnited States
| | - Joshua G Schraiber
- Department of BiologyTemple UniversityPhiladelphiaUnited States
- Center for Computational Genetics and GenomicsTemple UniversityPhiladelphiaUnited States
- Institute for Genomics and Evolutionary MedicineTemple UniversityPhiladelphiaUnited States
| | - Matthew W Hahn
- Department of BiologyIndiana UniversityBloomingtonUnited States
- Department of Computer ScienceIndiana UniversityBloomingtonUnited States
| |
Collapse
|
40
|
Wang RJ, Hahn MW. Speciation genes are more likely to have discordant gene trees. Evol Lett 2018; 2:281-296. [PMID: 30283682 PMCID: PMC6121824 DOI: 10.1002/evl3.77] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2018] [Revised: 06/15/2018] [Accepted: 07/06/2018] [Indexed: 12/27/2022] Open
Abstract
Speciation genes are responsible for reproductive isolation between species. By directly participating in the process of speciation, the genealogies of isolating loci have been thought to more faithfully represent species trees. The unique properties of speciation genes may provide valuable evolutionary insights and help determine the true history of species divergence. Here, we formally analyze whether genealogies from loci participating in Dobzhansky-Muller (DM) incompatibilities are more likely to be concordant with the species tree under incomplete lineage sorting (ILS). Individual loci differ stochastically from the true history of divergence with a predictable frequency due to ILS, and these expectations-combined with the DM model of intrinsic reproductive isolation from epistatic interactions-can be used to examine the probability of concordance at isolating loci. Contrary to existing verbal models, we find that reproductively isolating loci that follow the DM model are often more likely to have discordant gene trees. These results are dependent on the pattern of isolation observed between three species, the time between speciation events, and the time since the last speciation event. Results supporting a higher probability of discordance are found for both derived-derived and derived-ancestral DM pairs, and regardless of whether incompatibilities are allowed or prohibited from segregating in the same population. Our overall results suggest that DM loci are unlikely to be especially useful for reconstructing species relationships, even in the presence of gene flow between incipient species, and may in fact be positively misleading.
Collapse
Affiliation(s)
| | - Matthew W. Hahn
- Department of BiologyIndiana UniversityBloomingtonIndiana
- Department of Computer ScienceIndiana UniversityBloomingtonIndiana
| |
Collapse
|
41
|
Wu M, Kostyun JL, Hahn MW, Moyle LC. Dissecting the basis of novel trait evolution in a radiation with widespread phylogenetic discordance. Mol Ecol 2018; 27:3301-3316. [PMID: 29953708 DOI: 10.1111/mec.14780] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2017] [Revised: 01/15/2018] [Accepted: 01/19/2018] [Indexed: 01/03/2023]
Abstract
Phylogenetic analyses of trait evolution can provide insight into the evolutionary processes that initiate and drive phenotypic diversification. However, recent phylogenomic studies have revealed extensive gene tree-species tree discordance, which can lead to incorrect inferences of trait evolution if only a single species tree is used for analysis. This phenomenon-dubbed "hemiplasy"-is particularly important to consider during analyses of character evolution in rapidly radiating groups, where discordance is widespread. Here, we generate whole-transcriptome data for a phylogenetic analysis of 14 species in the plant genus Jaltomata (the sister clade to Solanum), which has experienced rapid, recent trait evolution, including in fruit and nectar colour, and flower size and shape. Consistent with other radiations, we find evidence for rampant gene tree discordance due to incomplete lineage sorting (ILS) and to introgression events among the well-supported subclades. As both ILS and introgression increase the probability of hemiplasy, we perform several analyses that take discordance into account while identifying genes that might contribute to phenotypic evolution. Despite discordance, the history of fruit colour evolution in Jaltomata can be inferred with high confidence, and we find evidence of de novo adaptive evolution at individual genes associated with fruit colour variation. In contrast, hemiplasy appears to strongly affect inferences about floral character transitions in Jaltomata, and we identify candidate loci that could arise either from multiple lineage-specific substitutions or standing ancestral polymorphisms. Our analysis provides a generalizable example of how to manage discordance when identifying loci associated with trait evolution in a radiating lineage.
Collapse
Affiliation(s)
- Meng Wu
- Department of Biology, Indiana University, Bloomington, Indiana
| | - Jamie L Kostyun
- Department of Biology, Indiana University, Bloomington, Indiana
- Department of Plant Biology, University of Vermont, Burlington, Vermont
| | - Matthew W Hahn
- Department of Biology, Indiana University, Bloomington, Indiana
- Department of Computer Science, Indiana University, Bloomington, Indiana
| | - Leonie C Moyle
- Department of Biology, Indiana University, Bloomington, Indiana
| |
Collapse
|
42
|
Koskela J. Multi-locus data distinguishes between population growth and multiple merger coalescents. Stat Appl Genet Mol Biol 2018; 17:sagmb-2017-0011. [DOI: 10.1515/sagmb-2017-0011] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Abstract
We introduce a low dimensional function of the site frequency spectrum that is tailor-made for distinguishing coalescent models with multiple mergers from Kingman coalescent models with population growth, and use this function to construct a hypothesis test between these model classes. The null and alternative sampling distributions of the statistic are intractable, but its low dimensionality renders them amenable to Monte Carlo estimation. We construct kernel density estimates of the sampling distributions based on simulated data, and show that the resulting hypothesis test dramatically improves on the statistical power of a current state-of-the-art method. A key reason for this improvement is the use of multi-locus data, in particular averaging observed site frequency spectra across unlinked loci to reduce sampling variance. We also demonstrate the robustness of our method to nuisance and tuning parameters. Finally we show that the same kernel density estimates can be used to conduct parameter estimation, and argue that our method is readily generalisable for applications in model selection, parameter inference and experimental design.
Collapse
Affiliation(s)
- Jere Koskela
- Department of Statistics , University of Warwick , Coventry, CV4 7AL , UK
| |
Collapse
|
43
|
Zhang W, Zhang X, Li K, Wang C, Cai L, Zhuang W, Xiang M, Liu X. Introgression and gene family contraction drive the evolution of lifestyle and host shifts of hypocrealean fungi. Mycology 2018; 9:176-188. [PMID: 30181924 PMCID: PMC6115877 DOI: 10.1080/21501203.2018.1478333] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2018] [Accepted: 05/15/2018] [Indexed: 12/20/2022] Open
Abstract
Hypocrealean fungi (Ascomycota) are known for their diversity of lifestyles. Their vital influences on agricultural and natural ecosystems have resulted in a number of sequenced genomes, which provide essential data for genomic analysis. Totally, 45 hypocrealean fungal genomes constructed a phylogeny. The phylogeny showed that plant pathogens in Nectriaceae diverged earliest, followed by animal pathogens in Cordycipitaceae, Ophiocordycipitaceae and Clavicipitaceae with mycoparasites in Hypocreaceae. Insect/nematode pathogens and grass endophytes in Clavicipitaceae diverged at last. Gene families associated with host-derived nutrients are significantly contracted in diverged lineages compared with the ancestral species. Introgression was detected in certain lineages of hypocrealean fungi, and the main functions of the genes located in the introgressed regions are involved in host recognition, transcriptional regulation, stress response and cell growth regulation. These results indicate that contraction of gene families and introgression might be main mechanisms to drive lifestyle differentiation and evolution and host shift of hypocrealean fungi.
Collapse
Affiliation(s)
- Weiwei Zhang
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Xiaoling Zhang
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| | - Kuan Li
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| | - Chengshu Wang
- Key Laboratory of Insect Developmental and Evolutionary Biology, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Lei Cai
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| | - Wenying Zhuang
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| | - Meichun Xiang
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| | - Xingzhong Liu
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
44
|
Akita T, Takuno S, Innan H. Coalescent framework for prokaryotes undergoing interspecific homologous recombination. Heredity (Edinb) 2018; 120:474-484. [PMID: 29358726 PMCID: PMC5889408 DOI: 10.1038/s41437-017-0034-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2017] [Revised: 10/04/2017] [Accepted: 10/23/2017] [Indexed: 12/11/2022] Open
Abstract
Coalescent process for prokaryote species is theoretically considered. Prokaryotes undergo homologous recombination with individuals of the same species (intraspecific recombination) and with individuals of other species (interspecific recombination). This work particularly focuses on interspecific recombination because intraspecific recombination has been well incorporated in coalescent framework. We present a simulation framework for generating SNP (single-nucleotide polymorphism) patterns that allows external DNA integration into host genome from other species. Using this simulation tool, msPro, we observed that the joint processes of intra- and interspecific recombination generate complex SNP patterns. The direct effect of interspecific recombination includes increased polymorphism. Because interspecific recombination is very rare in nature, it generates regions with exceptionally high polymorphism. Following interspecific recombination, intraspecific recombination cuts the integrated external DNA into small fragments, generating a complex SNP pattern that appears as if external DNA was integrated multiple times. The insight gained from our work using the msPro simulator will be useful for understanding and evaluating the relative contributions of intra- and interspecific recombination events in generating complex SNP patters in prokaryotes.
Collapse
Affiliation(s)
- Tetsuya Akita
- Graduate University for Advanced Studies, Hayama, Kanagawa, 240-0193, Japan
- National Research Institute of Far Seas Fisheries, Fisheries Research Agency, Yokohama, Kanagawa, 236-8648, Japan
| | - Shohei Takuno
- Graduate University for Advanced Studies, Hayama, Kanagawa, 240-0193, Japan
| | - Hideki Innan
- Graduate University for Advanced Studies, Hayama, Kanagawa, 240-0193, Japan.
| |
Collapse
|
45
|
Mendes FK, Hahn MW. Why Concatenation Fails Near the Anomaly Zone. Syst Biol 2018; 67:158-169. [PMID: 28973673 DOI: 10.1093/sysbio/syx063] [Citation(s) in RCA: 64] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2017] [Accepted: 06/30/2017] [Indexed: 11/12/2022] Open
Abstract
Genome-scale sequencing has been of great benefit in recovering species trees but has not provided final answers. Despite the rapid accumulation of molecular sequences, resolving short and deep branches of the tree of life has remained a challenge and has prompted the development of new strategies that can make the best use of available data. One such strategy-the concatenation of gene alignments-can be successful when coupled with many tree estimation methods, but has also been shown to fail when there are high levels of incomplete lineage sorting. Here, we focus on the failure of likelihood-based methods in retrieving a rooted, asymmetric four-taxon species tree from concatenated data when the species tree is in or near the anomaly zone-a region of parameter space where the most common gene tree does not match the species tree because of incomplete lineage sorting. First, we use coalescent theory to prove that most informative sites will support the species tree in the anomaly zone, and that as a consequence maximum-parsimony succeeds in recovering the species tree from concatenated data. We further show that maximum-likelihood tree estimation from concatenated data fails both inside and outside the anomaly zone, and that this failure cannot be easily predicted from the topology of the most common gene tree. We demonstrate that likelihood-based methods often fail in a region partially overlapping the anomaly zone, likely because of the lower relative cost of substitutions on discordant gene tree branches that are absent from the species tree. Our results confirm and extend previous reports on the performance of these methods applied to concatenated data from a rooted, asymmetric four-taxon species tree, and highlight avenues for future work improving the performance of methods aimed at recovering species tree.
Collapse
Affiliation(s)
- Fábio K Mendes
- Department of Biology, Indiana University, Bloomington, IN 47405, USA
| | - Matthew W Hahn
- Department of Biology, Indiana University, Bloomington, IN 47405, USA.,Department of Computer Science, Indiana University, Bloomington, IN 47405, USA
| |
Collapse
|
46
|
Saitou N. Neutral Evolution. INTRODUCTION TO EVOLUTIONARY GENOMICS 2018. [PMCID: PMC7121930 DOI: 10.1007/978-3-319-92642-1_5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
Neutral evolution is the default process of genomic changes. This is because our world is finite, and the randomness, indispensable for neutral evolution, is important when we consider the history of a finite world. The random nature of DNA propagation is discussed using branching process, coalescent process, Markov process, and diffusion process. Expected evolutionary patterns under neutrality are then discussed on fixation probability, rate of evolution, and amount of DNA variation kept in population. We then discuss various features of neutral evolution starting from evolutionary rates, synonymous and nonsynonymous substitutions, junk DNA, and pseudogenes.
Collapse
Affiliation(s)
- Naruya Saitou
- Division of Population Genetics, National Institute of Genetics (NIG), Mishima, Shizuoka Japan
| |
Collapse
|
47
|
Copetti D, Búrquez A, Bustamante E, Charboneau JLM, Childs KL, Eguiarte LE, Lee S, Liu TL, McMahon MM, Whiteman NK, Wing RA, Wojciechowski MF, Sanderson MJ. Extensive gene tree discordance and hemiplasy shaped the genomes of North American columnar cacti. Proc Natl Acad Sci U S A 2017; 114:12003-12008. [PMID: 29078296 PMCID: PMC5692538 DOI: 10.1073/pnas.1706367114] [Citation(s) in RCA: 54] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Few clades of plants have proven as difficult to classify as cacti. One explanation may be an unusually high level of convergent and parallel evolution (homoplasy). To evaluate support for this phylogenetic hypothesis at the molecular level, we sequenced the genomes of four cacti in the especially problematic tribe Pachycereeae, which contains most of the large columnar cacti of Mexico and adjacent areas, including the iconic saguaro cactus (Carnegiea gigantea) of the Sonoran Desert. We assembled a high-coverage draft genome for saguaro and lower coverage genomes for three other genera of tribe Pachycereeae (Pachycereus, Lophocereus, and Stenocereus) and a more distant outgroup cactus, Pereskia We used these to construct 4,436 orthologous gene alignments. Species tree inference consistently returned the same phylogeny, but gene tree discordance was high: 37% of gene trees having at least 90% bootstrap support conflicted with the species tree. Evidently, discordance is a product of long generation times and moderately large effective population sizes, leading to extensive incomplete lineage sorting (ILS). In the best supported gene trees, 58% of apparent homoplasy at amino sites in the species tree is due to gene tree-species tree discordance rather than parallel substitutions in the gene trees themselves, a phenomenon termed "hemiplasy." The high rate of genomic hemiplasy may contribute to apparent parallelisms in phenotypic traits, which could confound understanding of species relationships and character evolution in cacti.
Collapse
Affiliation(s)
- Dario Copetti
- Arizona Genomics Institute, School of Plant Sciences, University of Arizona, Tucson, AZ 85721
- International Rice Research Institute, Los Baños, Laguna, Philippines
| | - Alberto Búrquez
- Instituto de Ecología, Unidad Hermosillo, Universidad Nacional Autónoma de México, Hermosillo, Sonora, Mexico
| | - Enriquena Bustamante
- Instituto de Ecología, Unidad Hermosillo, Universidad Nacional Autónoma de México, Hermosillo, Sonora, Mexico
| | - Joseph L M Charboneau
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721
| | - Kevin L Childs
- Department of Plant Biology, Michigan State University, East Lansing, MI 48824
| | - Luis E Eguiarte
- Departamento de Ecología Evolutiva, Instituto de Ecología, Universidad Nacional Autónoma de México, Ciudad de México, Mexico
| | - Seunghee Lee
- Arizona Genomics Institute, School of Plant Sciences, University of Arizona, Tucson, AZ 85721
| | - Tiffany L Liu
- Department of Plant Biology, Michigan State University, East Lansing, MI 48824
| | | | - Noah K Whiteman
- Department of Integrative Biology, University of California, Berkeley, CA 94720
| | - Rod A Wing
- Arizona Genomics Institute, School of Plant Sciences, University of Arizona, Tucson, AZ 85721
- International Rice Research Institute, Los Baños, Laguna, Philippines
| | | | - Michael J Sanderson
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721;
| |
Collapse
|
48
|
Zou Z, Zhang J. Gene Tree Discordance Does Not Explain Away the Temporal Decline of Convergence in Mammalian Protein Sequence Evolution. Mol Biol Evol 2017; 34:1682-1688. [PMID: 28379570 DOI: 10.1093/molbev/msx109] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
Several authors reported lower frequencies of protein sequence convergence between more distantly related evolutionary lineages and attributed this trend to epistasis, which renders the acceptable amino acids at a site more different and convergence less likely in more divergent lineages. A recent primate study, however, suggested that this trend is at least partially and potentially entirely an artifact of gene tree discordance (GTD). Here, we demonstrate in a genome-wide data set from 17 mammals that the temporal trend remains (1) upon the control of the GTD level, (2) in genes whose genealogies are concordant with the species tree, and (3) for convergent changes, which are extremely unlikely to be caused by GTD. Similar results are observed in a comparable data set of 12 fruit flies in some but not all of these tests. We conclude that, at least in some cases, the temporal decline of convergence is genuine, reflecting an impact of epistasis on protein evolution.
Collapse
Affiliation(s)
- Zhengting Zou
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI
| | - Jianzhi Zhang
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI
| |
Collapse
|
49
|
Hutama A, Dahruddin H, Busson F, Sauri S, Keith P, Hadiaty RK, Hanner R, Suryobroto B, Hubert N. Identifying spatially concordant evolutionary significant units across multiple species through DNA barcodes: Application to the conservation genetics of the freshwater fishes of Java and Bali. Glob Ecol Conserv 2017. [DOI: 10.1016/j.gecco.2017.11.005] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022] Open
|
50
|
Ottenburghs J, Megens HJ, Kraus RHS, van Hooft P, van Wieren SE, Crooijmans RPMA, Ydenberg RC, Groenen MAM, Prins HHT. A history of hybrids? Genomic patterns of introgression in the True Geese. BMC Evol Biol 2017; 17:201. [PMID: 28830337 PMCID: PMC5568201 DOI: 10.1186/s12862-017-1048-2] [Citation(s) in RCA: 30] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2017] [Accepted: 08/11/2017] [Indexed: 12/19/2022] Open
Abstract
Background The impacts of hybridization on the process of speciation are manifold, leading to distinct patterns across the genome. Genetic differentiation accumulates in certain genomic regions, while divergence is hampered in other regions by homogenizing gene flow, resulting in a heterogeneous genomic landscape. A consequence of this heterogeneity is that genomes are mosaics of different gene histories that can be compared to unravel complex speciation and hybridization events. However, incomplete lineage sorting (often the outcome of rapid speciation) can result in similar patterns. New statistical techniques, such as the D-statistic and hybridization networks, can be applied to disentangle the contributions of hybridization and incomplete lineage sorting. We unravel patterns of hybridization and incomplete lineage sorting during and after the diversification of the True Geese (family Anatidae, tribe Anserini, genera Anser and Branta) using an exon-based hybridization network approach and taking advantage of discordant gene tree histories by re-sequencing all taxa of this clade. In addition, we determine the timing of introgression and reconstruct historical effective population sizes for all goose species to infer which demographic or biogeographic factors might explain the observed patterns of introgression. Results We find indications for ancient interspecific gene flow during the diversification of the True Geese and were able to pinpoint several putative hybridization events. Specifically, in the genus Branta, both the ancestor of the White-cheeked Geese (Hawaiian Goose, Canada Goose, Cackling Goose and Barnacle Goose) and the ancestor of the Brent Goose hybridized with Red-breasted Goose. One hybridization network suggests a hybrid origin for the Red-breasted Goose, but this scenario seems unlikely and it not supported by the D-statistic analysis. The complex, highly reticulated evolutionary history of the genus Anser hampered the estimation of ancient hybridization events by means of hybridization networks. The reconstruction of historical effective population sizes shows that most species showed a steady increase during the Pliocene and Pleistocene. These large effective population sizes might have facilitated contact between diverging goose species, resulting in the establishment of hybrid zones and consequent gene flow. Conclusions Our analyses suggest that the evolutionary history of the True Geese is influenced by introgressive hybridization. The approach that we have used, based on genome-wide phylogenetic incongruence and network analyses, will be a useful procedure to reconstruct the complex evolutionary histories of many naturally hybridizing species groups. Electronic supplementary material The online version of this article (doi:10.1186/s12862-017-1048-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jente Ottenburghs
- Resource Ecology Group, Wageningen University & Research, Droevendaalsesteeg 3a, 6708 PB, Wageningen, the Netherlands.
| | - Hendrik-Jan Megens
- Animal Breeding and Genomics, Wageningen University & Research, Droevendaalsesteeg 1, 6708 PB, Wageningen, the Netherlands
| | - Robert H S Kraus
- Department of Migration and Immuno-Ecology, Max Planck Institute for Ornithology, Am Obstberg, 1D-78315, Radolfzell, Germany.,Department of Biology, University of Konstanz, D-78457, Constance, Germany
| | - Pim van Hooft
- Resource Ecology Group, Wageningen University & Research, Droevendaalsesteeg 3a, 6708 PB, Wageningen, the Netherlands
| | - Sipke E van Wieren
- Resource Ecology Group, Wageningen University & Research, Droevendaalsesteeg 3a, 6708 PB, Wageningen, the Netherlands
| | - Richard P M A Crooijmans
- Animal Breeding and Genomics, Wageningen University & Research, Droevendaalsesteeg 1, 6708 PB, Wageningen, the Netherlands
| | - Ronald C Ydenberg
- Resource Ecology Group, Wageningen University & Research, Droevendaalsesteeg 3a, 6708 PB, Wageningen, the Netherlands.,Centre for Wildlife Ecology, Simon Fraser University, V5A 1S6, Burnaby, BC, Canada
| | - Martien A M Groenen
- Animal Breeding and Genomics, Wageningen University & Research, Droevendaalsesteeg 1, 6708 PB, Wageningen, the Netherlands
| | - Herbert H T Prins
- Resource Ecology Group, Wageningen University & Research, Droevendaalsesteeg 3a, 6708 PB, Wageningen, the Netherlands
| |
Collapse
|