1
|
De Maio N, Boulton W, Weilguny L, Walker CR, Turakhia Y, Corbett-Detig R, Goldman N. phastSim: Efficient simulation of sequence evolution for pandemic-scale datasets. PLoS Comput Biol 2022; 18:e1010056. [PMID: 35486906 PMCID: PMC9094560 DOI: 10.1371/journal.pcbi.1010056] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2021] [Revised: 05/11/2022] [Accepted: 03/25/2022] [Indexed: 11/26/2022] Open
Abstract
Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, and are an essential component of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here, we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100, 000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and it implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software allows easy integration with other Python packages as well as a variety of evolutionary models, including indel models and new hypermutability models that we developed to more realistically represent SARS-CoV-2 genome evolution.
Collapse
Affiliation(s)
- Nicola De Maio
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
| | - William Boulton
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
| | - Lukas Weilguny
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
| | - Conor R. Walker
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
- Department of Genetics, University of Cambridge, Cambridge, United Kingdom
| | - Yatish Turakhia
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, California, United States of America
| | - Russell Corbett-Detig
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America
- Genomics Institute, University of California Santa Cruz, Santa Cruz, California, United States of America
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United Kingdom
| |
Collapse
|
2
|
De Maio N, Boulton W, Weilguny L, Walker CR, Turakhia Y, Corbett-Detig R, Goldman N. phastSim: efficient simulation of sequence evolution for pandemic-scale datasets. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2021:2021.03.15.435416. [PMID: 33758852 PMCID: PMC7987011 DOI: 10.1101/2021.03.15.435416] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, as well as being part of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100,000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software is available from https://github.com/NicolaDM/phastSim and allows easy integration with other Python packages as well as a variety of evolutionary models, including indel models and new hypermutatability models that we developed to more realistically represent SARS-CoV-2 genome evolution.
Collapse
Affiliation(s)
- Nicola De Maio
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - William Boulton
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Lukas Weilguny
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Conor R. Walker
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
- Department of Genetics, University of Cambridge, Cambridge, CB2 3EH, UK
| | - Yatish Turakhia
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, CA 92093, USA
| | - Russell Corbett-Detig
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| |
Collapse
|
3
|
Davín AA, Tricou T, Tannier E, de Vienne DM, Szöllősi GJ. Zombi: a phylogenetic simulator of trees, genomes and sequences that accounts for dead linages. Bioinformatics 2020; 36:1286-1288. [PMID: 31566657 PMCID: PMC7031779 DOI: 10.1093/bioinformatics/btz710] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2019] [Revised: 09/09/2019] [Accepted: 09/26/2019] [Indexed: 11/14/2022] Open
Abstract
Summary Here we present Zombi, a tool to simulate the evolution of species, genomes and sequences in silico, that considers for the first time the evolution of genomes in extinct lineages. It also incorporates various features that have not to date been combined in a single simulator, such as the possibility of generating species trees with a pre-defined variation of speciation and extinction rates through time, simulating explicitly intergenic sequences of variable length and outputting gene tree—species tree reconciliations. Availability and implementation Source code and manual are freely available in https://github.com/AADavin/ZOMBI/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Adrián A Davín
- MTA-ELTE Lendület Evolutionary Genomics Research Group, Budapest, Hungary.,Department of Biological Physics, Eötvös Loránd, Budapest, Hungary
| | - Théo Tricou
- Univ Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Évolutive UMR5558, Villeurbanne F-69622, France
| | - Eric Tannier
- Univ Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Évolutive UMR5558, Villeurbanne F-69622, France.,INRIA Grenoble Rhône-Alpes, Montbonnot-Saint-Martin F-38334, France
| | - Damien M de Vienne
- Univ Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Évolutive UMR5558, Villeurbanne F-69622, France
| | - Gergely J Szöllősi
- MTA-ELTE Lendület Evolutionary Genomics Research Group, Budapest, Hungary.,Department of Biological Physics, Eötvös Loránd, Budapest, Hungary.,Evolutionary Systems Research Group, Centre for Ecological Research, Hungarian Academy of Sciences, Tihany H-8237, Hungary
| |
Collapse
|
4
|
Bonnici V, Maresi E, Giugno R. Challenges in gene-oriented approaches for pangenome content discovery. Brief Bioinform 2020; 22:5901976. [PMID: 32893299 DOI: 10.1093/bib/bbaa198] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2020] [Revised: 05/14/2020] [Accepted: 08/04/2020] [Indexed: 01/17/2023] Open
Abstract
Given a group of genomes, represented as the sets of genes that belong to them, the discovery of the pangenomic content is based on the search of genetic homology among the genes for clustering them into families. Thus, pangenomic analyses investigate the membership of the families to the given genomes. This approach is referred to as the gene-oriented approach in contrast to other definitions of the problem that takes into account different genomic features. In the past years, several tools have been developed to discover and analyse pangenomic contents. Because of the hardness of the problem, each tool applies a different strategy for discovering the pangenomic content. This results in a differentiation of the performance of each tool that depends on the composition of the input genomes. This review reports the main analysis instruments provided by the current state of the art tools for the discovery of pangenomic contents. Moreover, unlike previous works, the presented study compares pangenomic tools from a methodological perspective, analysing the causes that lead a given methodology to outperform other tools. The analysis is performed by taking into account different bacterial populations, which are synthetically generated by changing evolutionary parameters. The benchmarks used to compare the pangenomic tools, in addition to the computational pipeline developed for this purpose, are available at https://github.com/InfOmics/pangenes-review. Contact: V. Bonnici, R. Giugno Supplementary information: Supplementary data are available at Briefings in Bioinformatics online.
Collapse
Affiliation(s)
| | - Emiliano Maresi
- The Microsoft Research, University of Trento Centre for Computational and Systems Biology
| | - Rosalba Giugno
- Computer Science and Bioinformatics, referent of the Master Degree in Medical Bioinformatics
| |
Collapse
|
5
|
Bobay LM. CoreSimul: a forward-in-time simulator of genome evolution for prokaryotes modeling homologous recombination. BMC Bioinformatics 2020; 21:264. [PMID: 32580695 PMCID: PMC7315543 DOI: 10.1186/s12859-020-03619-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2020] [Accepted: 06/19/2020] [Indexed: 12/26/2022] Open
Abstract
Background Prokaryotes are asexual, but these organisms frequently engage in homologous recombination, a process that differs from meiotic recombination in sexual organisms. Most tools developed to simulate genome evolution either assume sexual reproduction or the complete absence of DNA flux in the population. As a result, very few simulators are adapted to model prokaryotic genome evolution while accounting for recombination. Moreover, many simulators are based on the coalescent, which assumes a neutral model of genomic evolution, and those are best suited for organisms evolving under weak selective pressures, such as animals and plants. In contrast, prokaryotes are thought to be evolving under much stronger selective pressures, suggesting that forward-in-time simulators are better suited for these organisms. Results Here, I present CoreSimul, a forward-in-time simulator of core genome evolution for prokaryotes modeling homologous recombination. Simulations are guided by a phylogenetic tree and incorporate different substitution models, including models of codon selection. Conclusions CoreSimul is a flexible forward-in-time simulator that constitutes a significant addition to the limited list of available simulators applicable to prokaryote genome evolution.
Collapse
Affiliation(s)
- Louis-Marie Bobay
- Department of Biology, University of North Carolina Greensboro, 321 McIver Street, PO Box 26170, Greensboro, NC, 27402, USA.
| |
Collapse
|
6
|
Kundu S, Bansal MS. SaGePhy: an improved phylogenetic simulation framework for gene and subgene evolution. Bioinformatics 2019; 35:3496-3498. [PMID: 30715213 DOI: 10.1093/bioinformatics/btz081] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2018] [Revised: 01/21/2019] [Accepted: 01/31/2019] [Indexed: 11/14/2022] Open
Abstract
SUMMARY SaGePhy is a software package for improved phylogenetic simulation of gene and subgene evolution. SaGePhy can be used to generate species trees, gene trees and subgene or (protein) domain trees using a probabilistic birth-death process that allows for gene and subgene duplication, horizontal gene and subgene transfer and gene and subgene loss. SaGePhy implements a range of important features not found in other phylogenetic simulation frameworks/software. These include (i) simulation of subgene or domain level evolution inside one or more gene trees, (ii) simultaneous simulation of both additive and replacing horizontal gene/subgene transfers and (iii) probabilistic sampling of species tree and gene tree nodes, respectively, for gene- and domain-family birth. SaGePhy is open-source, platform independent and written in Java and Python. AVAILABILITY AND IMPLEMENTATION Executables, source code (open-source under the revised BSD license) and a detailed manual are freely available from http://compbio.engr.uconn.edu/software/sagephy/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Soumya Kundu
- Department of Computer Science & Engineering, Storrs, CT, USA
| | - Mukul S Bansal
- Department of Computer Science & Engineering, Storrs, CT, USA.,The Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA
| |
Collapse
|
7
|
Bernard G, Chan CX, Chan YB, Chua XY, Cong Y, Hogan JM, Maetschke SR, Ragan MA. Alignment-free inference of hierarchical and reticulate phylogenomic relationships. Brief Bioinform 2019; 20:426-435. [PMID: 28673025 PMCID: PMC6433738 DOI: 10.1093/bib/bbx067] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2017] [Revised: 05/04/2017] [Indexed: 11/22/2022] Open
Abstract
We are amidst an ongoing flood of sequence data arising from the application of high-throughput technologies, and a concomitant fundamental revision in our understanding of how genomes evolve individually and within the biosphere. Workflows for phylogenomic inference must accommodate data that are not only much larger than before, but often more error prone and perhaps misassembled, or not assembled in the first place. Moreover, genomes of microbes, viruses and plasmids evolve not only by tree-like descent with modification but also by incorporating stretches of exogenous DNA. Thus, next-generation phylogenomics must address computational scalability while rethinking the nature of orthogroups, the alignment of multiple sequences and the inference and comparison of trees. New phylogenomic workflows have begun to take shape based on so-called alignment-free (AF) approaches. Here, we review the conceptual foundations of AF phylogenetics for the hierarchical (vertical) and reticulate (lateral) components of genome evolution, focusing on methods based on k-mers. We reflect on what seems to be successful, and on where further development is needed.
Collapse
|
8
|
Zielezinski A, Girgis HZ, Bernard G, Leimeister CA, Tang K, Dencker T, Lau AK, Röhling S, Choi JJ, Waterman MS, Comin M, Kim SH, Vinga S, Almeida JS, Chan CX, James BT, Sun F, Morgenstern B, Karlowski WM. Benchmarking of alignment-free sequence comparison methods. Genome Biol 2019; 20:144. [PMID: 31345254 PMCID: PMC6659240 DOI: 10.1186/s13059-019-1755-7] [Citation(s) in RCA: 101] [Impact Index Per Article: 20.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 07/03/2019] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. RESULTS Here, we present a community resource (http://afproject.org) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference, and reconstruction of species trees under horizontal gene transfer and recombination events. CONCLUSION The interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University Poznan, Uniwersytetu Poznańskiego 6, 61-614, Poznan, Poland
| | - Hani Z Girgis
- Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK, 74104, USA
| | | | - Chris-Andre Leimeister
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Kujin Tang
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
| | - Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Anna Katharina Lau
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Sophie Röhling
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Jae Jin Choi
- Department of Chemistry, University of California, Berkeley, CA, 94720, USA
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Michael S Waterman
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China
| | - Matteo Comin
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Sung-Hou Kim
- Department of Chemistry, University of California, Berkeley, CA, 94720, USA
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Susana Vinga
- INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
| | - Jonas S Almeida
- Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NIH/NCI), Bethesda, USA
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, and School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Benjamin T James
- Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK, 74104, USA
| | - Fengzhu Sun
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Wojciech M Karlowski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University Poznan, Uniwersytetu Poznańskiego 6, 61-614, Poznan, Poland.
| |
Collapse
|
9
|
Anselmetti Y, Luhmann N, Bérard S, Tannier E, Chauve C. Comparative Methods for Reconstructing Ancient Genome Organization. Methods Mol Biol 2018; 1704:343-362. [PMID: 29277873 DOI: 10.1007/978-1-4939-7463-4_13] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Abstract
Comparative genomics considers the detection of similarities and differences between extant genomes, and, based on more or less formalized hypotheses regarding the involved evolutionary processes, inferring ancestral states explaining the similarities and an evolutionary history explaining the differences. In this chapter, we focus on the reconstruction of the organization of ancient genomes into chromosomes. We review different methodological approaches and software, applied to a wide range of datasets from different kingdoms of life and at different evolutionary depths. We discuss relations with genome assembly, and potential approaches to validate computational predictions on ancient genomes that are almost always only accessible through these predictions.
Collapse
Affiliation(s)
- Yoann Anselmetti
- Institut des Sciences de l'Évolution, Université Montpellier 2, Montpellier, France
| | - Nina Luhmann
- Faculty of Technology, Bielefeld University, Bielefeld, Germany.,Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany.,International Research Training Group1906, Bielefeld University, Bielefeld, Germany
| | - Sèverine Bérard
- Institut des Sciences de l'Évolution, Université Montpellier 2, Montpellier, France
| | - Eric Tannier
- UMR CNRS 5558 - LBBE "Biométrie et Biologie Évolutive", Inria Grenoble Rhône-Alpes and University of Lyon, Lyon, France
| | - Cedric Chauve
- Department of Mathematics, Simon Fraser University, 8888 University Drive, Burnaby, BC, Canada, V5A 1S6.
| |
Collapse
|
10
|
Bernard G, Chan CX, Ragan MA. Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer. Sci Rep 2016; 6:28970. [PMID: 27363362 PMCID: PMC4929450 DOI: 10.1038/srep28970] [Citation(s) in RCA: 42] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2016] [Accepted: 06/13/2016] [Indexed: 12/22/2022] Open
Abstract
Alignment-free (AF) approaches have recently been highlighted as alternatives to methods based on multiple sequence alignment in phylogenetic inference. However, the sensitivity of AF methods to genome-scale evolutionary scenarios is little known. Here, using simulated microbial genome data we systematically assess the sensitivity of nine AF methods to three important evolutionary scenarios: sequence divergence, lateral genetic transfer (LGT) and genome rearrangement. Among these, AF methods are most sensitive to the extent of sequence divergence, less sensitive to low and moderate frequencies of LGT, and most robust against genome rearrangement. We describe the application of AF methods to three well-studied empirical genome datasets, and introduce a new application of the jackknife to assess node support. Our results demonstrate that AF phylogenomics is computationally scalable to multi-genome data and can generate biologically meaningful phylogenies and insights into microbial evolution.
Collapse
Affiliation(s)
- Guillaume Bernard
- Institute for Molecular Bioscience, and ARC Centre of Excellence in Bioinformatics, The University of Queensland, Brisbane, QLD 4072, Australia
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, and ARC Centre of Excellence in Bioinformatics, The University of Queensland, Brisbane, QLD 4072, Australia
| | - Mark A. Ragan
- Institute for Molecular Bioscience, and ARC Centre of Excellence in Bioinformatics, The University of Queensland, Brisbane, QLD 4072, Australia
| |
Collapse
|
11
|
|
12
|
Mallo D, De Oliveira Martins L, Posada D. SimPhy: Phylogenomic Simulation of Gene, Locus, and Species Trees. Syst Biol 2015; 65:334-44. [PMID: 26526427 PMCID: PMC4748750 DOI: 10.1093/sysbio/syv082] [Citation(s) in RCA: 76] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2015] [Accepted: 10/20/2015] [Indexed: 11/14/2022] Open
Abstract
We present a fast and flexible software package--SimPhy--for the simulation of multiple gene families evolving under incomplete lineage sorting, gene duplication and loss, horizontal gene transfer--all three potentially leading to species tree/gene tree discordance--and gene conversion. SimPhy implements a hierarchical phylogenetic model in which the evolution of species, locus, and gene trees is governed by global and local parameters (e.g., genome-wide, species-specific, locus-specific), that can be fixed or be sampled from a priori statistical distributions. SimPhy also incorporates comprehensive models of substitution rate variation among lineages (uncorrelated relaxed clocks) and the capability of simulating partitioned nucleotide, codon, and protein multilocus sequence alignments under a plethora of substitution models using the program INDELible. We validate SimPhy's output using theoretical expectations and other programs, and show that it scales extremely well with complex models and/or large trees, being an order of magnitude faster than the most similar program (DLCoal-Sim). In addition, we demonstrate how SimPhy can be useful to understand interactions among different evolutionary processes, conducting a simulation study to characterize the systematic overestimation of the duplication time when using standard reconciliation methods. SimPhy is available at https://github.com/adamallo/SimPhy, where users can find the source code, precompiled executables, a detailed manual and example cases.
Collapse
Affiliation(s)
- Diego Mallo
- Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo 36310, Spain
| | | | - David Posada
- Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo 36310, Spain
| |
Collapse
|
13
|
Spielman SJ, Wilke CO. Pyvolve: A Flexible Python Module for Simulating Sequences along Phylogenies. PLoS One 2015; 10:e0139047. [PMID: 26397960 PMCID: PMC4580465 DOI: 10.1371/journal.pone.0139047] [Citation(s) in RCA: 53] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2015] [Accepted: 09/07/2015] [Indexed: 11/19/2022] Open
Abstract
We introduce Pyvolve, a flexible Python module for simulating genetic data along a phylogeny using continuous-time Markov models of sequence evolution. Easily incorporated into Python bioinformatics pipelines, Pyvolve can simulate sequences according to most standard models of nucleotide, amino-acid, and codon sequence evolution. All model parameters are fully customizable. Users can additionally specify custom evolutionary models, with custom rate matrices and/or states to evolve. This flexibility makes Pyvolve a convenient framework not only for simulating sequences under a wide variety of conditions, but also for developing and testing new evolutionary models. Pyvolve is an open-source project under a FreeBSD license, and it is available for download, along with a detailed user-manual and example scripts, from http://github.com/sjspielman/pyvolve.
Collapse
Affiliation(s)
- Stephanie J. Spielman
- Department of Integrative Biology, Center for Computational Biology and Bioinformatics, and Institute of Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712, United States of America
| | - Claus O. Wilke
- Department of Integrative Biology, Center for Computational Biology and Bioinformatics, and Institute of Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712, United States of America
| |
Collapse
|
14
|
Abstract
Multiple sequence alignments (MSAs) are a prerequisite for a wide variety of evolutionary analyses. Published assessments and benchmark data sets for protein and, to a lesser extent, global nucleotide MSAs are available, but less effort has been made to establish benchmarks in the more general problem of whole-genome alignment (WGA). Using the same model as the successful Assemblathon competitions, we organized a competitive evaluation in which teams submitted their alignments and then assessments were performed collectively after all the submissions were received. Three data sets were used: Two were simulated and based on primate and mammalian phylogenies, and one was comprised of 20 real fly genomes. In total, 35 submissions were assessed, submitted by 10 teams using 12 different alignment pipelines. We found agreement between independent simulation-based and statistical assessments, indicating that there are substantial accuracy differences between contemporary alignment tools. We saw considerable differences in the alignment quality of differently annotated regions and found that few tools aligned the duplications analyzed. We found that many tools worked well at shorter evolutionary distances, but fewer performed competitively at longer distances. We provide all data sets, submissions, and assessment programs for further study and provide, as a resource for future benchmarking, a convenient repository of code and data for reproducing the simulation assessments.
Collapse
|
15
|
Whidden C, Zeh N, Beiko RG. Supertrees Based on the Subtree Prune-and-Regraft Distance. Syst Biol 2014; 63:566-81. [PMID: 24695589 PMCID: PMC4055872 DOI: 10.1093/sysbio/syu023] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2013] [Accepted: 03/18/2014] [Indexed: 11/14/2022] Open
Abstract
Supertree methods reconcile a set of phylogenetic trees into a single structure that is often interpreted as a branching history of species. A key challenge is combining conflicting evolutionary histories that are due to artifacts of phylogenetic reconstruction and phenomena such as lateral gene transfer (LGT). Many supertree approaches use optimality criteria that do not reflect underlying processes, have known biases, and may be unduly influenced by LGT. We present the first method to construct supertrees by using the subtree prune-and-regraft (SPR) distance as an optimality criterion. Although calculating the rooted SPR distance between a pair of trees is NP-hard, our new maximum agreement forest-based methods can reconcile trees with hundreds of taxa and>50 transfers in fractions of a second, which enables repeated calculations during the course of an iterative search. Our approach can accommodate trees in which uncertain relationships have been collapsed to multifurcating nodes. Using a series of benchmark datasets simulated under plausible rates of LGT, we show that SPR supertrees are more similar to correct species histories than supertrees based on parsimony or Robinson-Foulds distance criteria. We successfully constructed an SPR supertree from a phylogenomic dataset of 40,631 gene trees that covered 244 genomes representing several major bacterial phyla. Our SPR-based approach also allowed direct inference of highways of gene transfer between bacterial classes and genera. A Small number of these highways connect genera in different phyla and can highlight specific genes implicated in long-distance LGT. [Lateral gene transfer; matrix representation with parsimony; phylogenomics; prokaryotic phylogeny; Robinson-Foulds; subtree prune-and-regraft; supertrees.].
Collapse
Affiliation(s)
- Christopher Whidden
- Faculty of Computer Science, Dalhousie University, 6050 University Avenue, PO Box 15000, Halifax, Nova Scotia, Canada B3H 4R2
| | - Norbert Zeh
- Faculty of Computer Science, Dalhousie University, 6050 University Avenue, PO Box 15000, Halifax, Nova Scotia, Canada B3H 4R2
| | - Robert G Beiko
- Faculty of Computer Science, Dalhousie University, 6050 University Avenue, PO Box 15000, Halifax, Nova Scotia, Canada B3H 4R2
| |
Collapse
|
16
|
Arenas M, Posada D. Simulation of genome-wide evolution under heterogeneous substitution models and complex multispecies coalescent histories. Mol Biol Evol 2014; 31:1295-301. [PMID: 24557445 PMCID: PMC3995339 DOI: 10.1093/molbev/msu078] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
Genomic evolution can be highly heterogeneous. Here, we introduce a new framework to simulate genome-wide sequence evolution under a variety of substitution models that may change along the genome and the phylogeny, following complex multispecies coalescent histories that can include recombination, demographics, longitudinal sampling, population subdivision/species history, and migration. A key aspect of our simulation strategy is that the heterogeneity of the whole evolutionary process can be parameterized according to statistical prior distributions specified by the user. We used this framework to carry out a study of the impact of variable codon frequencies across genomic regions on the estimation of the genome-wide nonsynonymous/synonymous ratio. We found that both variable codon frequencies across genes and rate variation among sites and regions can lead to severe underestimation of the global dN/dS values. The program SGWE—Simulation of Genome-Wide Evolution—is freely available from http://code.google.com/p/sgwe-project/, including extensive documentation and detailed examples.
Collapse
Affiliation(s)
- Miguel Arenas
- Centre for Molecular Biology "Severo Ochoa," Consejo Superior de Investigaciones Científicas (CSIC), Madrid, Spain
| | | |
Collapse
|
17
|
Lapierre P, Lasek-Nesselquist E, Gogarten JP. The impact of HGT on phylogenomic reconstruction methods. Brief Bioinform 2012; 15:79-90. [DOI: 10.1093/bib/bbs050] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
|
18
|
Affiliation(s)
- Miguel Arenas
- Computational and Molecular Population Genetics Lab-CMPG, Institute of Ecology and Evolution, University of Bern, Bern, Switzerland.
| |
Collapse
|
19
|
Bhardwaj G, Ko KD, Hong Y, Zhang Z, Ho NL, Chintapalli SV, Kline LA, Gotlin M, Hartranft DN, Patterson ME, Dave F, Smith EJ, Holmes EC, Patterson RL, van Rossum DB. PHYRN: a robust method for phylogenetic analysis of highly divergent sequences. PLoS One 2012; 7:e34261. [PMID: 22514627 PMCID: PMC3325999 DOI: 10.1371/journal.pone.0034261] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2011] [Accepted: 02/24/2012] [Indexed: 11/19/2022] Open
Abstract
Both multiple sequence alignment and phylogenetic analysis are problematic in the "twilight zone" of sequence similarity (≤ 25% amino acid identity). Herein we explore the accuracy of phylogenetic inference at extreme sequence divergence using a variety of simulated data sets. We evaluate four leading multiple sequence alignment (MSA) methods (MAFFT, T-COFFEE, CLUSTAL, and MUSCLE) and six commonly used programs of tree estimation (Distance-based: Neighbor-Joining; Character-based: PhyML, RAxML, GARLI, Maximum Parsimony, and Bayesian) against a novel MSA-independent method (PHYRN) described here. Strikingly, at "midnight zone" genetic distances (~7% pairwise identity and 4.0 gaps per position), PHYRN returns high-resolution phylogenies that outperform traditional approaches. We reason this is due to PHRYN's capability to amplify informative positions, even at the most extreme levels of sequence divergence. We also assess the applicability of the PHYRN algorithm for inferring deep evolutionary relationships in the divergent DANGER protein superfamily, for which PHYRN infers a more robust tree compared to MSA-based approaches. Taken together, these results demonstrate that PHYRN represents a powerful mechanism for mapping uncharted frontiers in highly divergent protein sequence data sets.
Collapse
Affiliation(s)
- Gaurav Bhardwaj
- Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America
- Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America
- Department of Biochemistry and Molecular Medicine, School of Medicine, University of California Davis, Davis, California, United States of America
- Center for Translational Bioscience and Computing, University of California Davis, Davis, California, United States of America
| | - Kyung Dae Ko
- Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America
- Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Yoojin Hong
- Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America
- Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Zhenhai Zhang
- Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Ngai Lam Ho
- Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Sree V. Chintapalli
- Department of Physiology and Membrane Biology, School of Medicine, University of California Davis, Davis, California, United States of America
- Center for Translational Bioscience and Computing, University of California Davis, Davis, California, United States of America
| | - Lindsay A. Kline
- Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America
- Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Matthew Gotlin
- Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America
- Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - David Nicholas Hartranft
- Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America
- Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Morgen E. Patterson
- Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Foram Dave
- Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Evan J. Smith
- Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America
- Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Edward C. Holmes
- Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America
- Fogarty International Center, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Randen L. Patterson
- Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America
- Department of Biochemistry and Molecular Medicine, School of Medicine, University of California Davis, Davis, California, United States of America
- Department of Physiology and Membrane Biology, School of Medicine, University of California Davis, Davis, California, United States of America
- Center for Translational Bioscience and Computing, University of California Davis, Davis, California, United States of America
| | - Damian B. van Rossum
- Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America
- Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America
- Center for Translational Bioscience and Computing, University of California Davis, Davis, California, United States of America
| |
Collapse
|
20
|
Dalquen DA, Anisimova M, Gonnet GH, Dessimoz C. ALF--a simulation framework for genome evolution. Mol Biol Evol 2011; 29:1115-23. [PMID: 22160766 PMCID: PMC3341827 DOI: 10.1093/molbev/msr268] [Citation(s) in RCA: 111] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
In computational evolutionary biology, verification and benchmarking is a challenging task because the evolutionary history of studied biological entities is usually not known. Computer programs for simulating sequence evolution in silico have shown to be viable test beds for the verification of newly developed methods and to compare different algorithms. However, current simulation packages tend to focus either on gene-level aspects of genome evolution such as character substitutions and insertions and deletions (indels) or on genome-level aspects such as genome rearrangement and speciation events. Here, we introduce Artificial Life Framework (ALF), which aims at simulating the entire range of evolutionary forces that act on genomes: nucleotide, codon, or amino acid substitution (under simple or mixture models), indels, GC-content amelioration, gene duplication, gene loss, gene fusion, gene fission, genome rearrangement, lateral gene transfer (LGT), or speciation. The other distinctive feature of ALF is its user-friendly yet powerful web interface. We illustrate the utility of ALF with two possible applications: 1) we reanalyze data from a study of selection after globin gene duplication and test the statistical significance of the original conclusions and 2) we demonstrate that LGT can dramatically decrease the accuracy of two well-established orthology inference methods. ALF is available as a stand-alone application or via a web interface at http://www.cbrg.ethz.ch/alf.
Collapse
Affiliation(s)
- Daniel A Dalquen
- Computational Biochemistry Research Group, Department of Computer Science, ETH Zurich, Universitätstrasse 6, Zürich, Switzerland.
| | | | | | | |
Collapse
|
21
|
Holloway C, Beiko RG. Assembling networks of microbial genomes using linear programming. BMC Evol Biol 2010; 10:360. [PMID: 21092133 PMCID: PMC3224671 DOI: 10.1186/1471-2148-10-360] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2010] [Accepted: 11/20/2010] [Indexed: 01/04/2023] Open
Abstract
Background Microbial genomes exhibit complex sets of genetic affinities due to lateral genetic transfer. Assessing the relative contributions of parent-to-offspring inheritance and gene sharing is a vital step in understanding the evolutionary origins and modern-day function of an organism, but recovering and showing these relationships is a challenging problem. Results We have developed a new approach that uses linear programming to find between-genome relationships, by treating tables of genetic affinities (here, represented by transformed BLAST e-values) as an optimization problem. Validation trials on simulated data demonstrate the effectiveness of the approach in recovering and representing vertical and lateral relationships among genomes. Application of the technique to a set comprising Aquifex aeolicus and 75 other thermophiles showed an important role for large genomes as 'hubs' in the gene sharing network, and suggested that genes are preferentially shared between organisms with similar optimal growth temperatures. We were also able to discover distinct and common genetic contributors to each sequenced representative of genus Pseudomonas. Conclusions The linear programming approach we have developed can serve as an effective inference tool in its own right, and can be an efficient first step in a more-intensive phylogenomic analysis.
Collapse
Affiliation(s)
- Catherine Holloway
- Faculty of Computer Science, Dalhousie University, 6050 University Avenue, Halifax, Nova Scotia B3 H 1W5, Canada
| | | |
Collapse
|
22
|
Barker MS, Dlugosch KM, Dinh L, Challa RS, Kane NC, King MG, Rieseberg LH. EvoPipes.net: Bioinformatic Tools for Ecological and Evolutionary Genomics. Evol Bioinform Online 2010; 6:143-9. [PMID: 21079755 PMCID: PMC2978936 DOI: 10.4137/ebo.s5861] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Recent increases in the production of genomic data are yielding new opportunities and challenges for biologists. Among the chief problems posed by next-generation sequencing are assembly and analyses of these large data sets. Here we present an online server, http://EvoPipes.net, that provides access to a wide range of tools for bioinformatic analyses of genomic data oriented for ecological and evolutionary biologists. The EvoPipes.net server includes a basic tool kit for analyses of genomic data including a next-generation sequence cleaning pipeline (SnoWhite), scaffolded assembly software (SCARF), a reciprocal best-blast hit ortholog pipeline (RBH Orthologs), a pipeline for reference protein-based translation and identification of reading frame in transcriptome and genomic DNA (TransPipe), a pipeline to identify gene families and summarize the history of gene duplications (DupPipe), and a tool for developing SSRs or microsatellites from a transcriptome or genomic coding sequence collection (findSSR). EvoPipes.net also provides links to other software developed for evolutionary and ecological genomics, including chromEvol and NU-IN, as well as a forum for discussions of issues relating to genomic analyses and interpretation of results. Overall, these applications provide a basic bioinformatic tool kit that will enable ecologists and evolutionary biologists with relatively little experience and computational resources to take advantage of the opportunities provided by next-generation sequencing in their systems.
Collapse
Affiliation(s)
- Michael S Barker
- The Biodiversity Research Centre and Department of Botany, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
| | | | | | | | | | | | | |
Collapse
|
23
|
Dlugosch KM, Barker MS, Rieseberg LH. NU-IN: Nucleotide evolution and input module for the EvolSimulator genome simulation platform. BMC Res Notes 2010; 3:217. [PMID: 20678216 PMCID: PMC3161368 DOI: 10.1186/1756-0500-3-217] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2010] [Accepted: 08/02/2010] [Indexed: 11/21/2022] Open
Abstract
Background There is increasing demand to test hypotheses that contrast the evolution of genes and gene families among genomes, using simulations that work across these levels of organization. The EvolSimulator program was developed recently to provide a highly flexible platform for forward simulations of amino acid evolution in multiple related lineages of haploid genomes, permitting copy number variation and lateral gene transfer. Synonymous nucleotide evolution is not currently supported, however, and would be highly advantageous for comparisons to full genome, transcriptome, and single nucleotide polymorphism (SNP) datasets. In addition, EvolSimulator creates new genomes for each simulation, and does not allow the input of user-specified sequences and gene family information, limiting the incorporation of further biological realism and/or user manipulations of the data. Findings We present modified C++ source code for the EvolSimulator platform, which we provide as the extension module NU-IN. With NU-IN, synonymous and non-synonymous nucleotide evolution is fully implemented, and the user has the ability to use real or previously-simulated sequence data to initiate a simulation of one or more lineages. Gene family membership can be optionally specified, as well as gene retention probabilities that model biased gene retention. We provide PERL scripts to assist the user in deriving this information from previous simulations. We demonstrate the features of NU-IN by simulating genome duplication (polyploidy) in the presence of ongoing copy number variation in an evolving lineage. This example is initiated with real genomic data, and produces output that we analyse directly with existing bioinformatic pipelines. Conclusions The NU-IN extension module is a publicly available open source software (GNU GPLv3 license) extension to EvolSimulator. With the NU-IN module, users are now able to simulate both drift and selection at the nucleotide, amino acid, copy number, and gene family levels across sets of related genomes, for user-specified starting sequences and associated parameters. These features can be used to generate simulated genomic datasets under an extremely broad array of conditions, and with a high degree of biological realism.
Collapse
Affiliation(s)
- Katrina M Dlugosch
- Department of Botany, University of British Columbia, Vancouver, BC V6T1Z4, Canada.
| | | | | |
Collapse
|
24
|
Carvajal-Rodríguez A. Simulation of genes and genomes forward in time. Curr Genomics 2010; 11:58-61. [PMID: 20808525 PMCID: PMC2851118 DOI: 10.2174/138920210790218007] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2009] [Revised: 08/06/2009] [Accepted: 08/11/2009] [Indexed: 11/22/2022] Open
Abstract
The importance of simulation software in current and future evolutionary and genomic studies is just confirmed by the recent publication of several new simulation tools. The forward-in-time simulation strategy has, therefore, re-emerged as a complement of coalescent simulation. Additionally, more efficient coalescent algorithms, the same as new ideas about the combined use of backward and forward strategies have recently appeared. In the present work, a previous review is updated to include some new forward simulation tools. When simulating at the genome-scale the conflict between efficiency (i.e. execution speed and memory usage) and flexibility (i.e. complex modeling capabilities) emerges. This is the pivot around which simulation of evolutionary processes should improve. In addition, some effort should be made to consider the process of developing simulation tools from the point of view of the software engineering theory. Finally, some new ideas and technologies as general purpose graphic processing units are commented.
Collapse
|
25
|
Jermiin LS, Ho JWK, Lau KW, Jayaswal V. SeqVis: a tool for detecting compositional heterogeneity among aligned nucleotide sequences. Methods Mol Biol 2009; 537:65-91. [PMID: 19378140 DOI: 10.1007/978-1-59745-251-9_4] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/19/2023]
Abstract
Compositional heterogeneity is a poorly appreciated attribute of aligned nucleotide and amino acid sequences. It is a common property of molecular phylogenetic data, and it has been found to occur across sequences and/or across sites. Most molecular phylogenetic methods assume that the sequences have evolved under globally stationary, reversible, and homogeneous conditions, implying that the sequences should be compositionally homogeneous. The presence of the above-mentioned compositional heterogeneity implies that the sequences must have evolved under more general conditions than is commonly assumed. Consequently, there is a need for reliable methods to detect under what conditions alignments of nucleotides or amino acids may have evolved. In this chapter, we describe one such program. SeqVis is designed to survey aligned nucleotide sequences. We discuss pros-et-cons of this program in the context of other methods to detect compositional heterogeneity and violated phylogenetic assumptions. The benefits provided by SeqVis are demonstrated in two studies of alignments of nucleotides, one of which contained 7542 nucleotides from 53 species.
Collapse
Affiliation(s)
- Lars Sommer Jermiin
- School of Biological Sciences, Centre for Mathematical Biology and Sydney Bioinformatics, University of Sydney, Sydney, Australia
| | | | | | | |
Collapse
|
26
|
Abstract
This chapter discusses the pros and cons of the existing computational methods for the detection of horizontal (or lateral) gene transfer and highlights the genome-wide studies utilizing these methods. The impact of horizontal gene transfer (HGT) on prokaryote genome evolution is discussed.
Collapse
|
27
|
Beiko RG, Ragan MA. Untangling hybrid phylogenetic signals: horizontal gene transfer and artifacts of phylogenetic reconstruction. Methods Mol Biol 2009; 532:241-256. [PMID: 19271189 DOI: 10.1007/978-1-60327-853-9_14] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Phylogenomic methods can be used to investigate the tangled evolutionary relationships among genomes. Building 'all the trees of all the genes' can potentially identify common pathways of horizontal gene transfer (HGT) among taxa at varying levels of phylogenetic depth. Phylogenetic affinities can be aggregated and merged with the information about genetic linkage and biochemical function to examine hypotheses of adaptive evolution via HGT. Additionally, the use of many genetic data sets increases the power of statistical tests for phylogenetic artifacts. However, large-scale phylogenetic analyses pose several challenges, including the necessary abandonment of manual validation techniques, the need to translate inferred phylogenetic discordance into inferred HGT events, and the challenges involved in aggregating results from search-based inference methods. In this chapter we describe a tree search procedure to recover the most parsimonious pathways of HGT, and examine some of the assumptions that are made by this method.
Collapse
Affiliation(s)
- Robert G Beiko
- Department of Computer Science, Dalhousie University, Halifax, NS, Canada
| | | |
Collapse
|
28
|
Abstract
The subject of this chapter is to describe the methodology for assessing the power of phylogenetic HGT detection methods. Detection power is defined in the framework of hypothesis testing. Rates of false positives and false negatives can be estimated by testing HGT detection methods on HGT-free orthologous sets, and on the same sets with in silico simulated HGT events. The whole process can be divided into three steps: obtaining HGT-free orthologous sets, in silico simulation of HGT events in the same set, and submitting both sets for evaluation by any of the tested methods.Phylogenetic methods of HGT detection can be roughly divided into three types: likelihood-based tests of topologies (Kishino-Hasegawa (KH), Shimodaira-Hasegawa (SH), and Approximately Unbiased (AU) tests), tree distance methods (symmetrical difference of Robinson and Foulds (RF), and Subtree Pruning and Regrafting (SPR) distances), and genome spectral approaches (bipartition and quartet decomposition analysis). Restrictions that are inherent to phylogenetic methods of HGT detection in general and the power and precision of each method are discussed and comparative analyses of different approaches are provided, as well as some examples of assessing the power of phylogenetic HGT detection methods from a case study of orthologous sets from gamma-proteobacteria (Poptsova and Gogarten, BMC Evol Biol 7, 45, 2007) and cyanobacteria (Zhaxybayeva et al., Genome Res 16, 1099-108, 2006).
Collapse
Affiliation(s)
- Maria Poptsova
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| |
Collapse
|
29
|
Beiko RG, Doolittle WF, Charlebois RL. The Impact of Reticulate Evolution on Genome Phylogeny. Syst Biol 2008; 57:844-56. [DOI: 10.1080/10635150802559265] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022] Open
Affiliation(s)
- Robert G. Beiko
- Faculty of Computer Science, Dalhousie University, and Institute for Molecular Bioscience/ARC Centre for Bioinformatics
Brisbane, Australia; E-mail:
| | - W. Ford Doolittle
- Genome Atlantic, Department of Biochemistry & Molecular Biology, Dalhousie University
Halifax, Nova Scotia, Canada
| | - Robert L. Charlebois
- Genome Atlantic, Department of Biochemistry & Molecular Biology, Dalhousie University
Halifax, Nova Scotia, Canada
| |
Collapse
|
30
|
Abstract
When the stop codons TGA, TAA, and TAG are found in the second and third reading frames of a protein-encoding gene, they are considered premature stop codons (PSC). Deinococcus radiodurans disproportionately favored TGA more than the other two triplets as a PSC. The TGA triplet was also found more often in noncoding regions and as a stop codon, though the bias was less pronounced. We investigated this phenomenon in 72 bacterial species with widely differing chromosomal GC contents. Although TGA and TAG were compositionally similar, we found a great variation in use of TGA but a very limited range of use of TAG. The frequency of use of TGA in the gene sequences generally increased with the GC content of the chromosome, while the frequency of use of TAG, like that of TAA, was inversely proportional to the GC content of the chromosome. The patterns of use of TAA, TGA and TAG as real stop codons were less biased and less influenced by the GC content of the chromosome. Bacteria with higher chromosomal GC contents often contained fewer PSC trimers in their genes. Phylogenetically related bacteria often exhibited similar PSC ratios. In addition, metabolically versatile bacteria have significantly fewer PSC trimers in their genes. The bias toward TGA but against TAG as a PSC could not be explained either by the preferential usage of specific codons or by the GC contents of individual chromosomes. We proposed that the quantity and the quality of the PSC in the genome might be important in bacterial evolution.
Collapse
|