1
|
McBroome J, Thornlow B, Hinrichs AS, Kramer A, De Maio N, Goldman N, Haussler D, Corbett-Detig R, Turakhia Y. A Daily-Updated Database and Tools for Comprehensive SARS-CoV-2 Mutation-Annotated Trees. Mol Biol Evol 2021; 38:5819-5824. [PMID: 34469548 PMCID: PMC8662617 DOI: 10.1093/molbev/msab264] [Citation(s) in RCA: 49] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
The vast scale of SARS-CoV-2 sequencing data has made it increasingly challenging to comprehensively analyze all available data using existing tools and file formats. To address this, we present a database of SARS-CoV-2 phylogenetic trees inferred with unrestricted public sequences, which we update daily to incorporate new sequences. Our database uses the recently proposed mutation-annotated tree (MAT) format to efficiently encode the tree with branches labeled with parsimony-inferred mutations, as well as Nextstrain clade and Pango lineage labels at clade roots. As of June 9, 2021, our SARS-CoV-2 MAT consists of 834,521 sequences and provides a comprehensive view of the virus' evolutionary history using public data. We also present matUtils-a command-line utility for rapidly querying, interpreting, and manipulating the MATs. Our daily-updated SARS-CoV-2 MAT database and matUtils software are available at http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/ and https://github.com/yatisht/usher, respectively.
Collapse
Affiliation(s)
- Jakob McBroome
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Bryan Thornlow
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Angie S Hinrichs
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Alexander Kramer
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Nicola De Maio
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, United Kingdom
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, United Kingdom
| | - David Haussler
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Russell Corbett-Detig
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Yatish Turakhia
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| |
Collapse
|
2
|
Hübner L, Kozlov AM, Hespe D, Sanders P, Stamatakis A. Exploring parallel MPI fault tolerance mechanisms for phylogenetic inference with RAxML-NG. Bioinformatics 2021; 37:4056-4063. [PMID: 34037680 PMCID: PMC9502163 DOI: 10.1093/bioinformatics/btab399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2021] [Revised: 05/10/2021] [Accepted: 05/25/2021] [Indexed: 11/18/2022] Open
Abstract
MOTIVATION Phylogenetic trees are now routinely inferred on large scale high performance computing systems with thousands of cores as the parallel scalability of phylogenetic inference tools has improved over the past years to cope with the molecular data avalanche. Thus, the parallel fault tolerance of phylogenetic inference tools has become a relevant challenge. To this end, we explore parallel fault tolerance mechanisms and algorithms, the software modifications required and the performance penalties induced via enabling parallel fault tolerance by example of RAxML-NG, the successor of the widely used RAxML tool for maximum likelihood-based phylogenetic tree inference. RESULTS We find that the slowdown induced by the necessary additional recovery mechanisms in RAxML-NG is on average 1.00 ± 0.04. The overall slowdown by using these recovery mechanisms in conjunction with a fault-tolerant Message Passing Interface implementation amounts to on average 1.7 ± 0.6 for large empirical datasets. Via failure simulations, we show that RAxML-NG can successfully recover from multiple simultaneous failures, subsequent failures, failures during recovery and failures during checkpointing. Recoveries are automatic and transparent to the user. AVAILABILITY AND IMPLEMENTATION The modified fault-tolerant RAxML-NG code is available under GNU GPL at https://github.com/lukashuebner/ft-raxml-ng. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lukas Hübner
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Baden, Karlsruhe, Württemberg, Germany
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Baden, Heidelberg, Württemberg, Germany
| | - Alexey M Kozlov
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Baden, Heidelberg, Württemberg, Germany
| | - Demian Hespe
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Baden, Karlsruhe, Württemberg, Germany
| | - Peter Sanders
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Baden, Karlsruhe, Württemberg, Germany
| | - Alexandros Stamatakis
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Baden, Karlsruhe, Württemberg, Germany
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Baden, Heidelberg, Württemberg, Germany
| |
Collapse
|
3
|
Ralph P, Thornton K, Kelleher J. Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes. Genetics 2020; 215:779-797. [PMID: 32357960 PMCID: PMC7337078 DOI: 10.1534/genetics.120.303253] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2019] [Accepted: 04/28/2020] [Indexed: 12/11/2022] Open
Abstract
As a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence. The general approach accumulates sample weights within the genealogical tree at each position on the genome, which are then combined using a summary function; different statistics result from different choices of weight and function. Results can be reported in three ways: by site, which corresponds to statistics calculated as usual from genome sequence; by branch, which gives the expected value of the dual site statistic under the infinite sites model of mutation, and by node, which summarizes the contribution of each ancestor to these statistics. We use the framework to implement many currently defined statistics of genome sequence (making the statistics' relationship to the underlying genealogical trees concrete and explicit), as well as the corresponding branch statistics of tree shape. We evaluate computational performance using simulated data, and show that calculating statistics from tree sequences using this general framework is several orders of magnitude more efficient than optimized matrix-based methods in terms of both run time and memory requirements. We also explore how well the duality between site and branch statistics holds in practice on trees inferred from the 1000 Genomes Project data set, and discuss ways in which deviations may encode interesting biological signals.
Collapse
Affiliation(s)
- Peter Ralph
- Institute of Evolution and Ecology, Departments of Mathematics and Biology, University of Oregon, Eugene, Oregon 97405
| | - Kevin Thornton
- Department of Ecology and Evolutionary Biology, University of California, Irvine, California 92697
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, United Kingdom OX3 7LF
| |
Collapse
|
4
|
Inferring whole-genome histories in large population datasets. Nat Genet 2019; 51:1330-1338. [PMID: 31477934 PMCID: PMC6726478 DOI: 10.1038/s41588-019-0483-y] [Citation(s) in RCA: 113] [Impact Index Per Article: 22.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2019] [Accepted: 07/15/2019] [Indexed: 01/01/2023]
Abstract
Inferring the full genealogical history of a set of DNA sequences is a core problem in evolutionary biology, because this history encodes information about the events and forces that have influenced a species. However, current methods are limited, and the most accurate techniques are able to process no more than a hundred samples. As datasets that consist of millions of genomes are now being collected, there is a need for scalable and efficient inference methods to fully utilize these resources. Here we introduce an algorithm that is able to not only infer whole-genome histories with comparable accuracy to the state-of-the-art but also process four orders of magnitude more sequences. The approach also provides an 'evolutionary encoding' of the data, enabling efficient calculation of relevant statistics. We apply the method to human data from the 1000 Genomes Project, Simons Genome Diversity Project and UK Biobank, showing that the inferred genealogies are rich in biological signal and efficient to process.
Collapse
|
5
|
Sherwin WB, Chao A, Jost L, Smouse PE. Information Theory Broadens the Spectrum of Molecular Ecology and Evolution. Trends Ecol Evol 2017; 32:948-963. [PMID: 29126564 DOI: 10.1016/j.tree.2017.09.012] [Citation(s) in RCA: 42] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2017] [Revised: 09/22/2017] [Accepted: 09/26/2017] [Indexed: 01/18/2023]
Abstract
Information or entropy analysis of diversity is used extensively in community ecology, and has recently been exploited for prediction and analysis in molecular ecology and evolution. Information measures belong to a spectrum (or q profile) of measures whose contrasting properties provide a rich summary of diversity, including allelic richness (q=0), Shannon information (q=1), and heterozygosity (q=2). We present the merits of information measures for describing and forecasting molecular variation within and among groups, comparing forecasts with data, and evaluating underlying processes such as dispersal. Importantly, information measures directly link causal processes and divergence outcomes, have straightforward relationship to allele frequency differences (including monotonicity that q=2 lacks), and show additivity across hierarchical layers such as ecology, behaviour, cellular processes, and nongenetic inheritance.
Collapse
Affiliation(s)
- W B Sherwin
- Evolution and Ecology Research Centre, School of Biological Earth and Environmental Science, University of New South Wales, Sydney, NSW 2052, Australia; Murdoch University Cetacean Research Unit, Murdoch University, South Road, Murdoch, WA 6150, Australia.
| | - A Chao
- Institute of Statistics, National Tsing Hua University, Hsin-Chu 30043, Taiwan
| | - L Jost
- EcoMinga Foundation, Via a Runtun, Baños, Tungurahua, Ecuador
| | - P E Smouse
- Department of Ecology, Evolution and Natural Resources, School of Environmental and Biological Sciences, Rutgers University, New Brunswick, NJ 08901-8551, USA
| |
Collapse
|
6
|
Sherwin WB. Genes are information, so information theory is coming to the aid of evolutionary biology. Mol Ecol Resour 2016; 15:1259-61. [PMID: 26452559 DOI: 10.1111/1755-0998.12458] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2015] [Accepted: 08/17/2015] [Indexed: 11/28/2022]
Abstract
Speciation is central to evolutionary biology, and to elucidate it, we need to catch the early genetic changes that set nascent taxa on their path to species status (Via 2009). That challenge is difficult, of course, for two chief reasons: (i) serendipity is required to catch speciation in the act; and (ii) after a short time span with lingering gene flow, differentiation may be low and/or embodied only in rare alleles that are difficult to sample. In this issue of Molecular Ecology Resources, Smouse et al. (2015) have noted that optimal assessment of differentiation within and between nascent species should be robust to these challenges, and they identified a measure based on Shannon's information theory that has many advantages for this and numerous other tasks. The Shannon measure exhibits complete additivity of information at different levels of subdivision. Of all the family of diversity measures ('0' or allele counts, '1' or Shannon, '2' or heterozygosity, F(ST) and related metrics) Shannon's measure comes closest to weighting alleles by their frequencies. For the Shannon measure, rare alleles that represent early signals of nascent speciation are neither down-weighted to the point of irrelevance, as for level 2 measures, nor up-weighted to overpowering importance, as for level 0 measures (Chao et al. 2010, )2015. Shannon measures have a long history in population genetics, dating back to Shannon's PhD thesis in 1940 (Crow 2001), but have received only sporadic attention, until a resurgence of interest in the last ten years, as reviewed briefly by Smouse et al. (2015).
Collapse
Affiliation(s)
- William B Sherwin
- Evolution and Ecology Research Centre, University of NSW, Sydney, NSW, 2052, Australia.,Murdoch University Cetacean Research Unit, Murdoch University, South Road, Murdoch, WA, 6150, Australia
| |
Collapse
|
7
|
Implementing and testing the multispecies coalescent model: A valuable paradigm for phylogenomics. Mol Phylogenet Evol 2016; 94:447-62. [DOI: 10.1016/j.ympev.2015.10.027] [Citation(s) in RCA: 265] [Impact Index Per Article: 33.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
|
8
|
Cohen AR, Vitányi PM. Normalized Compression Distance of Multisets with Applications. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2015; 37:1602-14. [PMID: 26352998 PMCID: PMC4566858 DOI: 10.1109/tpami.2014.2375175] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
Pairwise normalized compression distance (NCD) is a parameter-free, feature-free, alignment-free, similarity metric based on compression. We propose an NCD of multisets that is also metric. Previously, attempts to obtain such an NCD failed. For classification purposes it is superior to the pairwise NCD in accuracy and implementation complexity. We cover the entire trajectory from theoretical underpinning to feasible practice. It is applied to biological (stem cell, organelle transport) and OCR classification questions that were earlier treated with the pairwise NCD. With the new method we achieved significantly better results. The theoretic foundation is Kolmogorov complexity.
Collapse
Affiliation(s)
- Andrew R. Cohen
- Department of Electrical and Computer Engineering, Drexel University. Address: A.R. Cohen, 3120–40 Market Street, Suite 313, Philadelphia, PA 19104, USA
| | - Paul M.B. Vitányi
- National research center for mathematics and computer science in the Netherlands (CWI), and the University of Amsterdam. Address: CWI, Science Park 123, 1098XG Amsterdam, The Netherlands
| |
Collapse
|
9
|
Stenz NWM, Larget B, Baum DA, Ané C. Exploring Tree-Like and Non-Tree-Like Patterns Using Genome Sequences: An Example Using the Inbreeding Plant SpeciesArabidopsis thaliana(L.) Heynh. Syst Biol 2015; 64:809-23. [DOI: 10.1093/sysbio/syv039] [Citation(s) in RCA: 48] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2014] [Accepted: 06/04/2015] [Indexed: 11/14/2022] Open
|
10
|
McTavish EJ, Hinchliff CE, Allman JF, Brown JW, Cranston KA, Holder MT, Rees JA, Smith SA. Phylesystem: a git-based data store for community-curated phylogenetic estimates. Bioinformatics 2015; 31:2794-800. [PMID: 25940563 PMCID: PMC4547614 DOI: 10.1093/bioinformatics/btv276] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2015] [Accepted: 04/27/2015] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Phylogenetic estimates from published studies can be archived using general platforms like Dryad (Vision, 2010) or TreeBASE (Sanderson et al., 1994). Such services fulfill a crucial role in ensuring transparency and reproducibility in phylogenetic research. However, digital tree data files often require some editing (e.g. rerooting) to improve the accuracy and reusability of the phylogenetic statements. Furthermore, establishing the mapping between tip labels used in a tree and taxa in a single common taxonomy dramatically improves the ability of other researchers to reuse phylogenetic estimates. As the process of curating a published phylogenetic estimate is not error-free, retaining a full record of the provenance of edits to a tree is crucial for openness, allowing editors to receive credit for their work and making errors introduced during curation easier to correct. RESULTS Here, we report the development of software infrastructure to support the open curation of phylogenetic data by the community of biologists. The backend of the system provides an interface for the standard database operations of creating, reading, updating and deleting records by making commits to a git repository. The record of the history of edits to a tree is preserved by git's version control features. Hosting this data store on GitHub (http://github.com/) provides open access to the data store using tools familiar to many developers. We have deployed a server running the 'phylesystem-api', which wraps the interactions with git and GitHub. The Open Tree of Life project has also developed and deployed a JavaScript application that uses the phylesystem-api and other web services to enable input and curation of published phylogenetic statements. AVAILABILITY AND IMPLEMENTATION Source code for the web service layer is available at https://github.com/OpenTreeOfLife/phylesystem-api. The data store can be cloned from: https://github.com/OpenTreeOfLife/phylesystem. A web application that uses the phylesystem web services is deployed at http://tree.opentreeoflife.org/curator. Code for that tool is available from https://github.com/OpenTreeOfLife/opentree. CONTACT mtholder@gmail.com.
Collapse
Affiliation(s)
- Emily Jane McTavish
- Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS, USA, Heidelberg Institute for Theoretical Studies, Heidelberg 69118, Germany
| | - Cody E Hinchliff
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI, USA
| | | | - Joseph W Brown
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI, USA
| | - Karen A Cranston
- National Evolutionary Synthesis Center, Duke University, Durham, NC, USA
| | - Mark T Holder
- Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS, USA, Heidelberg Institute for Theoretical Studies, Heidelberg 69118, Germany
| | - Jonathan A Rees
- National Evolutionary Synthesis Center, Duke University, Durham, NC, USA
| | - Stephen A Smith
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|
11
|
McMahon MM, Deepak A, Fernández-Baca D, Boss D, Sanderson MJ. STBase: one million species trees for comparative biology. PLoS One 2015; 10:e0117987. [PMID: 25679219 PMCID: PMC4332655 DOI: 10.1371/journal.pone.0117987] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2014] [Accepted: 01/05/2015] [Indexed: 11/29/2022] Open
Abstract
Comprehensively sampled phylogenetic trees provide the most compelling foundations for strong inferences in comparative evolutionary biology. Mismatches are common, however, between the taxa for which comparative data are available and the taxa sampled by published phylogenetic analyses. Moreover, many published phylogenies are gene trees, which cannot always be adapted immediately for species level comparisons because of discordance, gene duplication, and other confounding biological processes. A new database, STBase, lets comparative biologists quickly retrieve species level phylogenetic hypotheses in response to a query list of species names. The database consists of 1 million single- and multi-locus data sets, each with a confidence set of 1000 putative species trees, computed from GenBank sequence data for 413,000 eukaryotic taxa. Two bodies of theoretical work are leveraged to aid in the assembly of multi-locus concatenated data sets for species tree construction. First, multiply labeled gene trees are pruned to conflict-free singly-labeled species-level trees that can be combined between loci. Second, impacts of missing data in multi-locus data sets are ameliorated by assembling only decisive data sets. Data sets overlapping with the user's query are ranked using a scheme that depends on user-provided weights for tree quality and for taxonomic overlap of the tree with the query. Retrieval times are independent of the size of the database, typically a few seconds. Tree quality is assessed by a real-time evaluation of bootstrap support on just the overlapping subtree. Associated sequence alignments, tree files and metadata can be downloaded for subsequent analysis. STBase provides a tool for comparative biologists interested in exploiting the most relevant sequence data available for the taxa of interest. It may also serve as a prototype for future species tree oriented databases and as a resource for assembly of larger species phylogenies from precomputed trees.
Collapse
Affiliation(s)
- Michelle M. McMahon
- School of Plant Sciences, University of Arizona, Tucson, AZ, 85721, United States of America
| | - Akshay Deepak
- Department of Computer Science, Iowa State University, Ames, IA, 50011, United States of America
| | - David Fernández-Baca
- Department of Computer Science, Iowa State University, Ames, IA, 50011, United States of America
| | - Darren Boss
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ, 85721, United States of America
| | - Michael J. Sanderson
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ, 85721, United States of America
| |
Collapse
|
12
|
Morales-Cazan A, Albert JS. Monophyly of Heterandriini (Teleostei: Poeciliidae) revisited: a critical review of the data. NEOTROPICAL ICHTHYOLOGY 2012. [DOI: 10.1590/s1679-62252012000100003] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
The systematics and taxonomy of poeciliid fishes (guppies and allies) remain poorly understood despite the relative importance of these species as model systems in the biological sciences. This study focuses on testing the monophyly of the nominal poeciliine tribe Heterandriini and the genus Heterandria, through examination of the morphological characters on which the current classification is based. These characters include aspects of body shape (morphometrics), scale and fin-ray counts (meristics), pigmentation, the cephalic laterosensory system, and osteological features of the neurocranium, oral jaws and suspensorium, branchial basket, pectoral girdle, and the gonopodium and its supports. A Maximum Parsimony analysis was conducted of 150 characters coded for 56 poeciliid and outgroup species, including 22 of 45 heterandriin species (from the accounted in Parenti & Rauchenberger, 1989), or seven of nine heterandriin species (from the accounted in Lucinda & Reis, 2005). Multistate characters were analyzed as both unordered and ordered, and iterative a posteriori weighting was used to improve tree resolution. Tree topologies obtained from these analyses support the monophyly of the Middle American species of "Heterandria," which based on available phylogenetic information, are herein reassigned to the genus Pseudoxiphophorus. None of the characters used in previous studies to characterize the nominal taxon Heterandriini are found to be unambiguously diagnostic. Some of these characters are shared with species in other poeciliid tribes, and others are reversed within the Heterandriini. These results support the hypothesis that Pseudoxiphophorus is monophyletic, and that this clade is not the closest relative of H. formosa (the type species) from southeastern North America. Available morphological data are not sufficient to assess the phylogenetic relationships of H. formosa with respect to other members of the Heterandriini. The results further suggest that most tribe-level taxa of the Poeciliinae are not monophyletic, and that further work remains to resolve the evolutionary relationships of this group.
Collapse
|
13
|
Escobar JS, Scornavacca C, Cenci A, Guilhaumon C, Santoni S, Douzery EJP, Ranwez V, Glémin S, David J. Multigenic phylogeny and analysis of tree incongruences in Triticeae (Poaceae). BMC Evol Biol 2011; 11:181. [PMID: 21702931 PMCID: PMC3142523 DOI: 10.1186/1471-2148-11-181] [Citation(s) in RCA: 66] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2011] [Accepted: 06/24/2011] [Indexed: 11/30/2022] Open
Abstract
Background Introgressive events (e.g., hybridization, gene flow, horizontal gene transfer) and incomplete lineage sorting of ancestral polymorphisms are a challenge for phylogenetic analyses since different genes may exhibit conflicting genealogical histories. Grasses of the Triticeae tribe provide a particularly striking example of incongruence among gene trees. Previous phylogenies, mostly inferred with one gene, are in conflict for several taxon positions. Therefore, obtaining a resolved picture of relationships among genera and species of this tribe has been a challenging task. Here, we obtain the most comprehensive molecular dataset to date in Triticeae, including one chloroplastic and 26 nuclear genes. We aim to test whether it is possible to infer phylogenetic relationships in the face of (potentially) large-scale introgressive events and/or incomplete lineage sorting; to identify parts of the evolutionary history that have not evolved in a tree-like manner; and to decipher the biological causes of gene-tree conflicts in this tribe. Results We obtain resolved phylogenetic hypotheses using the supermatrix and Bayesian Concordance Factors (BCF) approaches despite numerous incongruences among gene trees. These phylogenies suggest the existence of 4-5 major clades within Triticeae, with Psathyrostachys and Hordeum being the deepest genera. In addition, we construct a multigenic network that highlights parts of the Triticeae history that have not evolved in a tree-like manner. Dasypyrum, Heteranthelium and genera of clade V, grouping Secale, Taeniatherum, Triticum and Aegilops, have evolved in a reticulated manner. Their relationships are thus better represented by the multigenic network than by the supermatrix or BCF trees. Noteworthy, we demonstrate that gene-tree incongruences increase with genetic distance and are greater in telomeric than centromeric genes. Together, our results suggest that recombination is the main factor decoupling gene trees from multigenic trees. Conclusions Our study is the first to propose a comprehensive, multigenic phylogeny of Triticeae. It clarifies several aspects of the relationships among genera and species of this tribe, and pinpoints biological groups with likely reticulate evolution. Importantly, this study extends previous results obtained in Drosophila by demonstrating that recombination can exacerbate gene-tree conflicts in phylogenetic reconstructions.
Collapse
Affiliation(s)
- Juan S Escobar
- Institut National de la Recherche Agronomique, Centre de Montpellier, UMR Diversité et Adaptation des Plantes Cultivées, Domaine de Melgueil, 34130 Mauguio, France.
| | | | | | | | | | | | | | | | | |
Collapse
|
14
|
Ané C. Detecting phylogenetic breakpoints and discordance from genome-wide alignments for species tree reconstruction. Genome Biol Evol 2011; 3:246-58. [PMID: 21362638 PMCID: PMC3070431 DOI: 10.1093/gbe/evr013] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
With the easy acquisition of sequence data, it is now possible to obtain and align whole genomes across multiple related species or populations. In this work, I assess the performance of a statistical method to reconstruct the whole distribution of phylogenetic trees along the genome, estimate the proportion of the genome for which a given clade is true, and infer a concordance tree that summarizes the dominant vertical inheritance pattern. There are two main issues when dealing with whole-genome alignments, as opposed to multiple genes: the size of the data and the detection of recombination breakpoints. These breakpoints partition the genomic alignment into phylogenetically homogeneous loci, where sites within a given locus all share the same phylogenetic tree topology. To delimitate these loci, I describe here a method based on the minimum description length (MDL) principle, implemented with dynamic programming for computational efficiency. Simulations show that combining MDL partitioning with Bayesian concordance analysis provides an efficient and robust way to estimate both the vertical inheritance signal and the horizontal phylogenetic signal. The method performed well both in the presence of incomplete lineage sorting and in the presence of horizontal gene transfer. A high level of systematic bias was found here, highlighting the need for good individual tree building methods, which form the basis for more elaborate gene tree/species tree reconciliation methods.
Collapse
Affiliation(s)
- Cécile Ané
- Departments of Statistics and Botany, University of Wisconsin-Madison, USA.
| |
Collapse
|
15
|
Entropy and Information Approaches to Genetic Diversity and its Expression: Genomic Geography. ENTROPY 2010. [DOI: 10.3390/e12071765] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
|
16
|
White MA, Ané C, Dewey CN, Larget BR, Payseur BA. Fine-scale phylogenetic discordance across the house mouse genome. PLoS Genet 2009; 5:e1000729. [PMID: 19936022 PMCID: PMC2770633 DOI: 10.1371/journal.pgen.1000729] [Citation(s) in RCA: 81] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2009] [Accepted: 10/19/2009] [Indexed: 11/18/2022] Open
Abstract
Population genetic theory predicts discordance in the true phylogeny of different genomic regions when studying recently diverged species. Despite this expectation, genome-wide discordance in young species groups has rarely been statistically quantified. The house mouse subspecies group provides a model system for examining phylogenetic discordance. House mouse subspecies are recently derived, suggesting that even if there has been a simple tree-like population history, gene trees could disagree with the population history due to incomplete lineage sorting. Subspecies of house mice also hybridize in nature, raising the possibility that recent introgression might lead to additional phylogenetic discordance. Single-locus approaches have revealed support for conflicting topologies, resulting in a subspecies tree often summarized as a polytomy. To analyze phylogenetic histories on a genomic scale, we applied a recently developed method, Bayesian concordance analysis, to dense SNP data from three closely related subspecies of house mice: Mus musculus musculus, M. m. castaneus, and M. m. domesticus. We documented substantial variation in phylogenetic history across the genome. Although each of the three possible topologies was strongly supported by a large number of loci, there was statistical evidence for a primary phylogenetic history in which M. m. musculus and M. m. castaneus are sister subspecies. These results underscore the importance of measuring phylogenetic discordance in other recently diverged groups using methods such as Bayesian concordance analysis, which are designed for this purpose.
Collapse
Affiliation(s)
- Michael A. White
- Laboratory of Genetics, University of Wisconsin, Madison, Wisconsin, United States of America
| | - Cécile Ané
- Department of Statistics, University of Wisconsin, Madison, Wisconsin, United States of America
- Department of Botany, University of Wisconsin, Madison, Wisconsin, United States of America
| | - Colin N. Dewey
- Department of Biostatistics, University of Wisconsin, Madison, Wisconsin, United States of America
- Department of Medical Informatics, University of Wisconsin, Madison, Wisconsin, United States of America
- Department of Computer Sciences, University of Wisconsin, Madison, Wisconsin, United States of America
| | - Bret R. Larget
- Department of Statistics, University of Wisconsin, Madison, Wisconsin, United States of America
- Department of Botany, University of Wisconsin, Madison, Wisconsin, United States of America
| | - Bret A. Payseur
- Laboratory of Genetics, University of Wisconsin, Madison, Wisconsin, United States of America
| |
Collapse
|
17
|
Abstract
Because species names play an important role in scientific communication, it is more important that species be understood to be taxa than that they be equated with functional ecological or evolutionary entities. Although most biologists would agree that taxa are composed of organisms that share a unique common history, 2 major challenges remain in developing a species-as-taxa concept. First, grouping: in the face of genealogical discordance at all levels in the taxonomic hierarchy, how can we understand the nature of taxa? Second, ranking: what criteria should be used to designate certain taxa in a nested series as being species? The grouping problem can be solved by viewing taxa as exclusive groups of organisms- sets of organisms that form a clade for a plurality of the genome (more than any conflicting set). However, no single objective criterion of species rank can be proposed. Instead, the species rank should be assigned by practitioners based on the semisubjective application of a set of species-ranking criteria. Although these criteria can be designed to yield species taxa that approximately match the ecological, evolutionary, and morphological entities that taxonomists have traditionally associated with the species rank, such a correspondence cannot be enforced without undermining the assumption that species are taxa. The challenge and art of monography is to use genealogical and other kinds of data to assign all organisms to one and only one species-ranked taxon. Various implications of the species-as-ranked-taxa view are discussed, including the synchronic nature of taxa, fossil species, the treatment of hybrids, and species nomenclature. I conclude that, although challenges remain, adopting the view that species are ranked taxa will facilitate a much-needed revolution in taxonomy that will allow it to better serve the biodiversity informatic needs of the 21st century.
Collapse
Affiliation(s)
- David A Baum
- Department of Botany, University of Wisconsin-Madison, 430 Lincoln Drive, Madison, WI 53706, USA.
| |
Collapse
|
18
|
Treangen TJ, Darling AE, Achaz G, Ragan MA, Messeguer X, Rocha EPC. A novel heuristic for local multiple alignment of interspersed DNA repeats. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2009; 6:180-189. [PMID: 19407343 DOI: 10.1109/tcbb.2009.9] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Pairwise local sequence alignment methods have been the prevailing technique to identify homologous nucleotides between related species. However, existing methods that identify and align all homologous nucleotides in one or more genomes have suffered from poor scalability and limited accuracy. We propose a novel method that couples a gapped extension heuristic with an efficient filtration method for identifying interspersed repeats in genome sequences. During gapped extension, we use the MUSCLE implementation of progressive global multiple alignment with iterative refinement. The resulting gapped extensions potentially contain alignments of unrelated sequence. We detect and remove such undesirable alignments using a hidden Markov model (HMM) to predict the posterior probability of homology. The HMM emission frequencies for nucleotide substitutions can be derived from any time-reversible nucleotide substitution matrix. We evaluate the performance of our method and previous approaches on a hybrid data set of real genomic DNA with simulated interspersed repeats. Our method outperforms a related method in terms of sensitivity, positive predictive value, and localizing boundaries of homology. The described methods have been implemented in freely available software, Repeatoire, available from: http://wwwabi.snv.jussieu.fr/public/Repeatoire.
Collapse
|
19
|
Chen D, Burleigh GJ, Fernández-Baca D. Spectral partitioning of phylogenetic data sets based on compatibility. Syst Biol 2007; 56:623-32. [PMID: 17654366 DOI: 10.1080/10635150701499571] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022] Open
Abstract
We describe two new methods to partition phylogenetic data sets of discrete characters based on pairwise compatibility. The partitioning methods make no assumptions regarding the phylogeny, model of evolution, or characteristics of the data. The methods first build a compatibility graph, in which each node represents a character in the data set. Edges in the compatibility graph may represent strict compatibility of characters or they may be weighted based on a fractional compatibility scoring procedure that measures how close the characters are to being compatible. Given the desired number of partitions, the partitioning methods then seek to cluster the characters with the highest average pairwise compatibility, so that characters in each cluster are more compatible with each other than they are with characters in the other cluster(s). Partitioning according to these criteria is computationally intractable (NP-hard); however, spectral methods can quickly provide high-quality solutions. We demonstrate that the spectral partitioning effectively identifies characters with different evolutionary histories in simulated data sets, and it is better at highlighting phylogenetic conflict within empirical data sets than previously used partitioning methods.
Collapse
Affiliation(s)
- Duhong Chen
- Department of Computer Science, Iowa State University, Ames, IA 50011, USA
| | | | | |
Collapse
|
20
|
Darling AE, Treangen TJ, Zhang L, Kuiken C, Messeguer X, Perna NT. Procrastination Leads to Efficient Filtration for Local Multiple Alignment. LECTURE NOTES IN COMPUTER SCIENCE 2006. [DOI: 10.1007/11851561_12] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
|