1
|
Vieira Mourato B, Tsers I, Denker S, Klötzl F, Haubold B. Marker discovery in the large. BIOINFORMATICS ADVANCES 2024; 4:vbae113. [PMID: 39132289 PMCID: PMC11310107 DOI: 10.1093/bioadv/vbae113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/16/2024] [Revised: 07/06/2024] [Accepted: 07/26/2024] [Indexed: 08/13/2024]
Abstract
Motivation Markers for diagnostic polymerase chain reactions are routinely constructed by taking regions common to the genomes of a target organism and subtracting the regions found in the targets' closest relatives, their neighbors. This approach is implemented in the published package Fur, which originally required memory proportional to the number of nucleotides in the neighborhood. This does not scale well. Results Here, we describe a new version of Fur that only requires memory proportional to the longest neighbor. In spite of its greater memory efficiency, the new Fur remains fast and is accurate. We demonstrate this by applying it to simulated sequences and comparing it to an efficient alternative. Then we use the new Fur to extract markers from 120 reference bacteria. To make this feasible, we also introduce software for automatically finding target and neighbor genomes and for assessing markers. We pick the best primers from the 10 most sequenced reference bacteria and show their excellent in silico sensitivity and specificity. Availability and implementation Fur is available from github.com/evolbioinf/fur, in the Docker image hub.docker.com/r/beatrizvm/mapro, and in the Code Ocean capsule 10.24433/CO.7955947.v1.
Collapse
Affiliation(s)
- Beatriz Vieira Mourato
- Research Group Bioinformatics, Max-Planck-Institute for Evolutionary Biology, 24306 Plön, Schleswig-Holstein, Germany
| | - Ivan Tsers
- Research Group Bioinformatics, Max-Planck-Institute for Evolutionary Biology, 24306 Plön, Schleswig-Holstein, Germany
| | - Svenja Denker
- Research Group Bioinformatics, Max-Planck-Institute for Evolutionary Biology, 24306 Plön, Schleswig-Holstein, Germany
- Universität zu Lübeck, Lübeck, Schleswig-Holstein, Germany
| | | | - Bernhard Haubold
- Research Group Bioinformatics, Max-Planck-Institute for Evolutionary Biology, 24306 Plön, Schleswig-Holstein, Germany
| |
Collapse
|
2
|
Prusokiene A, Boonham N, Fox A, Howard TP. Mottle: Accurate pairwise substitution distance at high divergence through the exploitation of short-read mappers and gradient descent. PLoS One 2024; 19:e0298834. [PMID: 38512939 PMCID: PMC10956839 DOI: 10.1371/journal.pone.0298834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 01/30/2024] [Indexed: 03/23/2024] Open
Abstract
Current tools for estimating the substitution distance between two related sequences struggle to remain accurate at a high divergence. Difficulties at distant homologies, such as false seeding and over-alignment, create a high barrier for the development of a stable estimator. This is especially true for viral genomes, which carry a high rate of mutation, small size, and sparse taxonomy. Developing an accurate substitution distance measure would help to elucidate the relationship between highly divergent sequences, interrogate their evolutionary history, and better facilitate the discovery of new viral genomes. To tackle these problems, we propose an approach that uses short-read mappers to create whole-genome maps, and gradient descent to isolate the homologous fraction and calculate the final distance value. We implement this approach as Mottle. With the use of simulated and biological sequences, Mottle was able to remain stable to 0.66-0.96 substitutions per base pair and identify viral outgroup genomes with 95% accuracy at the family-order level. Our results indicate that Mottle performs as well as existing programs in identifying taxonomic relationships, with more accurate numerical estimation of genomic distance over greater divergences. By contrast, one limitation is a reduced numerical accuracy at low divergences, and on genomes where insertions and deletions are uncommon, when compared to alternative approaches. We propose that Mottle may therefore be of particular interest in the study of viruses, viral relationships, and notably for viral discovery platforms, helping in benchmarking of homology search tools and defining the limits of taxonomic classification methods. The code for Mottle is available at https://github.com/tphoward/Mottle_Repo.
Collapse
Affiliation(s)
- Alisa Prusokiene
- Faculty of Science, Agriculture and Engineering, School of Natural and Environmental Sciences, Newcastle University, United Kingdom
| | - Neil Boonham
- Faculty of Science, Agriculture and Engineering, School of Natural and Environmental Sciences, Newcastle University, United Kingdom
| | - Adrian Fox
- Faculty of Science, Agriculture and Engineering, School of Natural and Environmental Sciences, Newcastle University, United Kingdom
- Fera Ltd., Biotech Campus, York, United Kingdom
| | - Thomas P. Howard
- Faculty of Science, Agriculture and Engineering, School of Natural and Environmental Sciences, Newcastle University, United Kingdom
| |
Collapse
|
3
|
Anjum N, Nabil RL, Rafi RI, Bayzid MS, Rahman MS. CD-MAWS: An Alignment-Free Phylogeny Estimation Method Using Cosine Distance on Minimal Absent Word Sets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:196-205. [PMID: 34928803 DOI: 10.1109/tcbb.2021.3136792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Multiple sequence alignment has been the traditional and well established approach of sequence analysis and comparison, though it is time and memory consuming. As the scale of sequencing data is increasing day by day, the importance of faster yet accurate alignment-free methods is on the rise. Several alignment-free sequence analysis methods have been established in the literature in recent years, which extract numerical features from genomic data to analyze sequences and also to estimate phylogenetic relationship among genes and species. Minimal Absent Word (MAW) is an effective concept for representing characteristics of a sequence in an alignment-free manner. In this study, we present CD-MAWS, a distance measure based on cosine of the angle between composition vectors constructed using minimal absent words, for sequence analysis in a computationally inexpensive manner. We have benchmarked CD-MAWS using several AFProject datasets, such as Fish mtDNA, E.coli, Plants, Shigella and Yersinia datasets, and found it to perform quite well. Applied on several other biological datasets such as mammal mtDNA, bacterial genomes and viral genomes, CD-MAWS resolved phylogenetic relationships similar to or better than state-of-the-art alignment-free methods such as Mash, Skmer, Co-phylog and kSNP3.
Collapse
|
4
|
Genome Sequence of the Diploid Yeast Debaryomyces hansenii TMW 3.1188. Microbiol Resour Announc 2022; 11:e0064922. [PMID: 36287019 PMCID: PMC9670972 DOI: 10.1128/mra.00649-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Debaryomyces hansenii TMW 3.1188 is a halotolerant diploid yeast that was isolated from lupine moromi fermentation. Here, we report on the 24.77-Mbp genome of a diploid strain of the species D. hansenii.
Collapse
|
5
|
Chen J, Yang L, Li L, Goodison S, Sun Y. Alignment-free comparison of metagenomics sequences via approximate string matching. BIOINFORMATICS ADVANCES 2022; 2:vbac077. [PMID: 36388153 PMCID: PMC9645238 DOI: 10.1093/bioadv/vbac077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Revised: 09/16/2022] [Accepted: 10/19/2022] [Indexed: 11/11/2022]
Abstract
Summary Quantifying pairwise sequence similarities is a key step in metagenomics studies. Alignment-free methods provide a computationally efficient alternative to alignment-based methods for large-scale sequence analysis. Several neural network-based methods have recently been developed for this purpose. However, existing methods do not perform well on sequences of varying lengths and are sensitive to the presence of insertions and deletions. In this article, we describe the development of a new method, referred to as AsMac that addresses the aforementioned issues. We proposed a novel neural network structure for approximate string matching for the extraction of pertinent information from biological sequences and developed an efficient gradient computation algorithm for training the constructed neural network. We performed a large-scale benchmark study using real-world data that demonstrated the effectiveness and potential utility of the proposed method. Availability and implementation The open-source software for the proposed method and trained neural-network models for some commonly used metagenomics marker genes were developed and are freely available at www.acsu.buffalo.edu/~yijunsun/lab/AsMac.html. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Lu Li
- Department of Oral Biology, University at Buffalo, Buffalo, NY 14215, USA
| | - Steve Goodison
- Department of Quantitative Health Sciences, Mayo Clinic, Jacksonville, FL 32224, USA
| | - Yijun Sun
- To whom correspondence should be addressed.
| |
Collapse
|
6
|
Cunial F, Denas O, Belazzougui D. Fast and compact matching statistics analytics. Bioinformatics 2022; 38:1838-1845. [PMID: 35134833 PMCID: PMC9665870 DOI: 10.1093/bioinformatics/btac064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2021] [Revised: 01/08/2022] [Accepted: 01/31/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Fast, lightweight methods for comparing the sequence of ever larger assembled genomes from ever growing databases are increasingly needed in the era of accurate long reads and pan-genome initiatives. Matching statistics is a popular method for computing whole-genome phylogenies and for detecting structural rearrangements between two genomes, since it is amenable to fast implementations that require a minimal setup of data structures. However, current implementations use a single core, take too much memory to represent the result, and do not provide efficient ways to analyze the output in order to explore local similarities between the sequences. RESULTS We develop practical tools for computing matching statistics between large-scale strings, and for analyzing its values, faster and using less memory than the state-of-the-art. Specifically, we design a parallel algorithm for shared-memory machines that computes matching statistics 30 times faster with 48 cores in the cases that are most difficult to parallelize. We design a lossy compression scheme that shrinks the matching statistics array to a bitvector that takes from 0.8 to 0.2 bits per character, depending on the dataset and on the value of a threshold, and that achieves 0.04 bits per character in some variants. And we provide efficient implementations of range-maximum and range-sum queries that take a few tens of milliseconds while operating on our compact representations, and that allow computing key local statistics about the similarity between two strings. Our toolkit makes construction, storage and analysis of matching statistics arrays practical for multiple pairs of the largest genomes available today, possibly enabling new applications in comparative genomics. AVAILABILITY AND IMPLEMENTATION Our C/C++ code is available at https://github.com/odenas/indexed_ms under GPL-3.0. The data underlying this article are available in NCBI Genome at https://www.ncbi.nlm.nih.gov/genome and in the International Genome Sample Resource (IGSR) at https://www.internationalgenome.org. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Fabio Cunial
- Max Planck Institute for Molecular Cell Biology and Genetics (MPI-CBG and CSBD), Dresden 01307, Germany,To whom correspondence should be addressed.
| | | | - Djamal Belazzougui
- CAPA, DTISI, Centre de Recherche sur l’Information Scientifique et Techique, Algiers, Algeria
| |
Collapse
|
7
|
Blanca A, Harris RS, Koslicki D, Medvedev P. The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches. J Comput Biol 2022; 29:155-168. [PMID: 35108101 DOI: 10.1089/cmb.2021.0431] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.
Collapse
Affiliation(s)
- Antonio Blanca
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, USA
| | - Robert S Harris
- Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, USA
| | - David Koslicki
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, USA.,Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, USA.,Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania, USA
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, USA.,Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania, USA.,Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, USA
| |
Collapse
|
8
|
Sequence Comparison Without Alignment: The SpaM Approaches. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2021; 2231:121-134. [PMID: 33289890 DOI: 10.1007/978-1-0716-1036-7_8] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
Sequence alignment is at the heart of DNA and protein sequence analysis. For the data volumes that are nowadays produced by massively parallel sequencing technologies, however, pairwise and multiple alignment methods are often too slow. Therefore, fast alignment-free approaches to sequence comparison have become popular in recent years. Most of these approaches are based on word frequencies, for words of a fixed length, or on word-matching statistics. Other approaches are using the length of maximal word matches. While these methods are very fast, most of them rely on ad hoc measures of sequences similarity or dissimilarity that are hard to interpret. In this chapter, I describe a number of alignment-free methods that we developed in recent years. Our approaches are based on spaced-word matches ("SpaM"), i.e. on inexact word matches, that are allowed to contain mismatches at certain pre-defined positions. Unlike most previous alignment-free approaches, our approaches are able to accurately estimate phylogenetic distances between DNA or protein sequences using a stochastic model of molecular evolution.
Collapse
|
9
|
Girgis HZ, James BT, Luczak BB. Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models. NAR Genom Bioinform 2021; 3:lqab001. [PMID: 33554117 PMCID: PMC7850047 DOI: 10.1093/nargab/lqab001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2019] [Revised: 12/07/2020] [Accepted: 01/08/2021] [Indexed: 11/12/2022] Open
Abstract
Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic-slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in an optimal alignment-including gaps-of two sequences); there is no need for visualizing how every two sequences are aligned. For these applications, we propose Identity, which produces global identity scores for a large number of pairs of DNA sequences using alignment-free methods and self-supervised general linear models. For the first time, the new tool can predict pairwise identity scores in linear time and space. On two large-scale sequence databases, Identity provided the best compromise between sensitivity and precision while being faster than BLAST, Mash, MUMmer4 and USEARCH by 2-80 times. Identity was the best performing tool when searching for low-identity matches. While constructing phylogenetic trees from about 6000 transcripts, the tree due to the scores reported by Identity was the closest to the reference tree (in contrast to andi, FSWM and Mash). Identity is capable of producing pairwise identity scores of millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any global-alignment-based tool. Availability: https://github.com/BioinformaticsToolsmith/Identity.
Collapse
Affiliation(s)
- Hani Z Girgis
- Bioinformatics Toolsmith Laboratory, Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, 700 University Boulevard, Kingsville, TX 78363, USA
| | - Benjamin T James
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar Street, Cambridge, MA 02139, USA
| | - Brian B Luczak
- Department of Mathematics, Vanderbilt University, 1326 Stevenson Center Lane, Nashville, TN 3721, USA
| |
Collapse
|
10
|
Criscuolo A. On the transformation of MinHash-based uncorrected distances into proper evolutionary distances for phylogenetic inference. F1000Res 2020; 9:1309. [PMID: 33335719 PMCID: PMC7713896 DOI: 10.12688/f1000research.26930.1] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 10/12/2020] [Indexed: 12/29/2022] Open
Abstract
Recently developed MinHash-based techniques were proven successful in quickly estimating the level of similarity between large nucleotide sequences. This article discusses their usage and limitations in practice to approximating uncorrected distances between genomes, and transforming these pairwise dissimilarities into proper evolutionary distances. It is notably shown that complex distance measures can be easily approximated using simple transformation formulae based on few parameters. MinHash-based techniques can therefore be very useful for implementing fast yet accurate alignment-free phylogenetic reconstruction procedures from large sets of genomes. This last point of view is assessed with a simulation study using a dedicated bioinformatics tool.
Collapse
Affiliation(s)
- Alexis Criscuolo
- Hub de Bioinformatique et Biostatistique - Département Biologie Computationnelle, Institut Pasteur, USR 3756, CNRS, 75015 Paris, France
| |
Collapse
|
11
|
David-Palma M, Libkind D, Brito PH, Silva M, Bellora N, Coelho MA, Heitman J, Gonçalves P, Sampaio JP. The Untapped Australasian Diversity of Astaxanthin-Producing Yeasts with Biotechnological Potential- Phaffia australis sp. nov. and Phaffia tasmanica sp. nov. Microorganisms 2020; 8:E1651. [PMID: 33114402 PMCID: PMC7692969 DOI: 10.3390/microorganisms8111651] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2020] [Revised: 10/18/2020] [Accepted: 10/21/2020] [Indexed: 01/28/2023] Open
Abstract
Phaffia is an orange-colored basidiomycetous yeast genus of the order Cystofilobasidiales that contains a single species, P. rhodozyma. This species is the only fungus known to produce the economically relevant carotenoid astaxanthin. Although Phaffia was originally found in the Northern hemisphere, its diversity in the southern part of the globe has been shown to be much greater. Here we analyze the genomes of two Australasian lineages that are markedly distinct from P. rhodozyma. The two divergent lineages were investigated within a comprehensive phylogenomic study of representatives of the Cystofilobasidiales that supported the recognition of two novel Phaffia species, for which we propose the names of P. australis sp. nov. and P. tasmanica sp. nov. Comparative genomics and other analyses confirmed that the two new species have the typical Phaffia hallmark-the six genes necessary for the biosynthesis of astaxanthin could be retrieved from the draft genome sequences, and this carotenoid was detected in culture extracts. In addition, the organization of the mating-type (MAT) loci is similar to that of P. rhodozyma, with synteny throughout most regions. Moreover, cases of trans-specific polymorphism involving pheromone receptor genes and pheromone precursor proteins in the three Phaffia species, together with their shared homothallism, provide additional support for their classification in a single genus.
Collapse
Affiliation(s)
- Márcia David-Palma
- UCIBIO, Departamento de Ciências da Vida, Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, 2829-516 Caparica, Portugal; (M.D.-P.); (P.H.B.); (M.S.); (M.A.C.); (P.G.)
- Department of Molecular Genetics and Microbiology, Duke University Medical Center, Durham, NC 27710, USA;
| | - Diego Libkind
- Centro de Referencia en Levaduras y Tecnología Cervecera (CRELTEC), Instituto Andino Patagónico de Tecnologías Biológicas y Geoambientales (IPATEC)—CONICET/Universidad Nacional del Comahue, Bariloche, Rio Negro 8400, Argentina; (D.L.); (N.B.)
| | - Patrícia H. Brito
- UCIBIO, Departamento de Ciências da Vida, Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, 2829-516 Caparica, Portugal; (M.D.-P.); (P.H.B.); (M.S.); (M.A.C.); (P.G.)
| | - Margarida Silva
- UCIBIO, Departamento de Ciências da Vida, Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, 2829-516 Caparica, Portugal; (M.D.-P.); (P.H.B.); (M.S.); (M.A.C.); (P.G.)
| | - Nicolás Bellora
- Centro de Referencia en Levaduras y Tecnología Cervecera (CRELTEC), Instituto Andino Patagónico de Tecnologías Biológicas y Geoambientales (IPATEC)—CONICET/Universidad Nacional del Comahue, Bariloche, Rio Negro 8400, Argentina; (D.L.); (N.B.)
| | - Marco A. Coelho
- UCIBIO, Departamento de Ciências da Vida, Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, 2829-516 Caparica, Portugal; (M.D.-P.); (P.H.B.); (M.S.); (M.A.C.); (P.G.)
- Department of Molecular Genetics and Microbiology, Duke University Medical Center, Durham, NC 27710, USA;
| | - Joseph Heitman
- Department of Molecular Genetics and Microbiology, Duke University Medical Center, Durham, NC 27710, USA;
| | - Paula Gonçalves
- UCIBIO, Departamento de Ciências da Vida, Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, 2829-516 Caparica, Portugal; (M.D.-P.); (P.H.B.); (M.S.); (M.A.C.); (P.G.)
| | - José Paulo Sampaio
- UCIBIO, Departamento de Ciências da Vida, Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, 2829-516 Caparica, Portugal; (M.D.-P.); (P.H.B.); (M.S.); (M.A.C.); (P.G.)
| |
Collapse
|
12
|
Klötzl F, Haubold B. Phylonium: fast estimation of evolutionary distances from large samples of similar genomes. Bioinformatics 2020; 36:2040-2046. [PMID: 31790149 PMCID: PMC7141870 DOI: 10.1093/bioinformatics/btz903] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2019] [Revised: 11/01/2019] [Accepted: 11/28/2019] [Indexed: 11/13/2022] Open
Abstract
Motivation Tracking disease outbreaks by whole-genome sequencing leads to the collection of large samples of closely related sequences. Five years ago, we published a method to accurately compute all pairwise distances for such samples by indexing each sequence. Since indexing is slow, we now ask whether it is possible to achieve similar accuracy when indexing only a single sequence. Results We have implemented this idea in the program phylonium and show that it is as accurate as its predecessor and roughly 100 times faster when applied to all 2678 Escherichia coli genomes contained in ENSEMBL. One of the best published programs for rapidly computing pairwise distances, mash, analyzes the same dataset four times faster but, with default settings, it is less accurate than phylonium. Availability and implementation Phylonium runs under the UNIX command line; its C++ sources and documentation are available from github.com/evolbioinf/phylonium. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Fabian Klötzl
- Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, Plön, Germany
| | - Bernhard Haubold
- Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, Plön, Germany
| |
Collapse
|
13
|
Libkind D, Čadež N, Opulente DA, Langdon QK, Rosa CA, Sampaio JP, Gonçalves P, Hittinger CT, Lachance MA. Towards yeast taxogenomics: lessons from novel species descriptions based on complete genome sequences. FEMS Yeast Res 2020; 20:5876348. [DOI: 10.1093/femsyr/foaa042] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2020] [Accepted: 07/23/2020] [Indexed: 01/23/2023] Open
Abstract
ABSTRACT
In recent years, ‘multi-omic’ sciences have affected all aspects of fundamental and applied biological research. Yeast taxonomists, though somewhat timidly, have begun to incorporate complete genomic sequences into the description of novel taxa, taking advantage of these powerful data to calculate more reliable genetic distances, construct more robust phylogenies, correlate genotype with phenotype and even reveal cryptic sexual behaviors. However, the use of genomic data in formal yeast species descriptions is far from widespread. The present review examines published examples of genome-based species descriptions of yeasts, highlights relevant bioinformatic approaches, provides recommendations for new users and discusses some of the challenges facing the genome-based systematics of yeasts.
Collapse
Affiliation(s)
- D Libkind
- Centro de Referencia en Levaduras y Tecnología Cervecera (CRELTEC), Instituto Andino Patagónico de Tecnologías Biológicas y Geoambientales (IPATEC) – CONICET / Universidad Nacional del Comahue, Bariloche, Argentina
| | - N Čadež
- Biotechnical Faculty, University of Ljubljana, Jamnikarjeva 101, 1000 Ljubljana, Slovenia
| | - D A Opulente
- Laboratory of Genetics, Wisconsin Energy Institute, J. F. Crow Institute for the Study of Evolution, Center for Genomic Science Innovation, University of Wisconsin-Madison, Madison, WI, USA
- DOE Great Lakes Bioenergy Research Center, University of Wisconsin-Madison, Madison, WI, USA
| | - Q K Langdon
- Laboratory of Genetics, Wisconsin Energy Institute, J. F. Crow Institute for the Study of Evolution, Center for Genomic Science Innovation, University of Wisconsin-Madison, Madison, WI, USA
| | - C A Rosa
- Departamento de Microbiologia, ICB, C.P. 486, Universidade Federal de Minas Gerais, Belo Horizonte, MG, 31270–901, Brazil
| | - J P Sampaio
- UCIBIO, Departamento de Ciências da Vida, Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, 2829-516 Caparica, Portugal
| | - P Gonçalves
- UCIBIO, Departamento de Ciências da Vida, Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, 2829-516 Caparica, Portugal
| | - C T Hittinger
- Laboratory of Genetics, Wisconsin Energy Institute, J. F. Crow Institute for the Study of Evolution, Center for Genomic Science Innovation, University of Wisconsin-Madison, Madison, WI, USA
- DOE Great Lakes Bioenergy Research Center, University of Wisconsin-Madison, Madison, WI, USA
| | - M A Lachance
- Department of Biology, University of Western Ontario, London N6A 5B7, Ontario, Canada
| |
Collapse
|
14
|
Bohmann K, Mirarab S, Bafna V, Gilbert MTP. Beyond DNA barcoding: The unrealized potential of genome skim data in sample identification. Mol Ecol 2020; 29:2521-2534. [PMID: 32542933 PMCID: PMC7496323 DOI: 10.1111/mec.15507] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2020] [Revised: 06/03/2020] [Accepted: 06/05/2020] [Indexed: 02/06/2023]
Abstract
Genetic tools are increasingly used to identify and discriminate between species. One key transition in this process was the recognition of the potential of the ca 658bp fragment of the organelle cytochrome c oxidase I (COI) as a barcode region, which revolutionized animal bioidentification and lead, among others, to the instigation of the Barcode of Life Database (BOLD), containing currently barcodes from >7.9 million specimens. Following this discovery, suggestions for other organellar regions and markers, and the primers with which to amplify them, have been continuously proposed. Most recently, the field has taken the leap from PCR-based generation of DNA references into shotgun sequencing-based "genome skimming" alternatives, with the ultimate goal of assembling organellar reference genomes. Unfortunately, in genome skimming approaches, much of the nuclear genome (as much as 99% of the sequence data) is discarded, which is not only wasteful, but can also limit the power of discrimination at, or below, the species level. Here, we advocate that the full shotgun sequence data can be used to assign an identity (that we term for convenience its "DNA-mark") for both voucher and query samples, without requiring any computationally intensive pretreatment (e.g. assembly) of reads. We argue that if reference databases are populated with such "DNA-marks," it will enable future DNA-based taxonomic identification to complement, or even replace PCR of barcodes with genome skimming, and we discuss how such methodology ultimately could enable identification to population, or even individual, level.
Collapse
Affiliation(s)
- Kristine Bohmann
- Section for Evolutionary GenomicsThe GLOBE InstituteUniversity of CopenhagenCopenhagenDenmark
| | - Siavash Mirarab
- Department of Electrical and Computer EngineeringUniversity of CaliforniaSan DiegoCAUSA
| | - Vineet Bafna
- Department of Computer Science and EngineeringUniversity of CaliforniaSan DiegoCAUSA
| | - M. Thomas P. Gilbert
- Section for Evolutionary GenomicsThe GLOBE InstituteUniversity of CopenhagenCopenhagenDenmark
- Center for Evolutionary HologenomicsThe GLOBE InstituteUniversity of CopenhagenCopenhagenDenmark
- NTNU University MuseumTrondheimNorway
| |
Collapse
|
15
|
Pirogov A, Pfaffelhuber P, Börsch-Haubold A, Haubold B. High-complexity regions in mammalian genomes are enriched for developmental genes. Bioinformatics 2020; 35:1813-1819. [PMID: 30395202 PMCID: PMC6546125 DOI: 10.1093/bioinformatics/bty922] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Revised: 09/17/2018] [Accepted: 11/02/2018] [Indexed: 12/01/2022] Open
Abstract
Motivation Unique sequence regions are associated with genetic function in vertebrate genomes. However, measuring uniqueness, or absence of long repeats, along a genome is conceptually and computationally difficult. Here we use a variant of the Lempel-Ziv complexity, the match complexity, Cm, and augment it by deriving its null distribution for random sequences. We then apply Cm to the human and mouse genomes to investigate the relationship between sequence complexity and function. Results We implemented Cm in the program macle and show through simulation that the newly derived null distribution of Cm is accurate. This allows us to delineate high-complexity regions in the human and mouse genomes. Using our program macle2go, we find that these regions are twofold enriched for genes. Moreover, the genes contained in these regions are more than 10-fold enriched for developmental functions. Availability and implementation Source code for macle and macle2go is available from www.github.com/evolbioinf/macle and www.github.com/evolbioinf/macle2go, respectively; Cm browser tracks from guanine.evolbio.mgp.de/complexity. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Anton Pirogov
- Lehrstuhl für Informatik, RWTH Aachen University, Max-Planck-Institute for Evolutionary Biology, Plön, Germany.,Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, Plön, Germany
| | | | | | - Bernhard Haubold
- Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, Plön, Germany
| |
Collapse
|
16
|
Miller JB, McKinnon LM, Whiting MF, Kauwe JSK, Ridge PG. Codon Pairs are Phylogenetically Conserved: A comprehensive analysis of codon pairing conservation across the Tree of Life. PLoS One 2020; 15:e0232260. [PMID: 32401752 PMCID: PMC7219770 DOI: 10.1371/journal.pone.0232260] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2019] [Accepted: 04/10/2020] [Indexed: 11/27/2022] Open
Abstract
Identical codon pairing and co-tRNA codon pairing increase translational efficiency within genes when two codons that encode the same amino acid are translated by the same tRNA before it diffuses from the ribosome. We examine the phylogenetic signal in both identical and co-tRNA codon pairing across 23 428 species using alignment-free and parsimony methods. We determined that conserved codon pairing typically has a smaller window size than the length of a ribosome, and codon pairing tracks phylogenies across various taxonomic groups. We report a comprehensive analysis of codon pairing, including the extent to which each codon pairs. Our parsimony method generally recovers phylogenies that are more congruent with the established phylogenies than our alignment-free method. However, four of the ten taxonomic groups did not have sufficient orthologous codon pairings and were therefore analyzed using only the alignment-free methods. Since the recovered phylogenies using only codon pairing largely match phylogenies from the Open Tree of Life and the NCBI taxonomy, and are comparable to trees recovered by other algorithms, we propose that codon pairing biases are phylogenetically conserved and should be considered in conjunction with other phylogenomic techniques.
Collapse
Affiliation(s)
- Justin B. Miller
- Department of Biology, Brigham Young University, Provo, UT, United States of America
| | - Lauren M. McKinnon
- Department of Biology, Brigham Young University, Provo, UT, United States of America
| | - Michael F. Whiting
- Department of Biology, Brigham Young University, Provo, UT, United States of America
- M.L. Bean Museum, Brigham Young University, Provo, UT, United States of America
| | - John S. K. Kauwe
- Department of Biology, Brigham Young University, Provo, UT, United States of America
| | - Perry G. Ridge
- Department of Biology, Brigham Young University, Provo, UT, United States of America
| |
Collapse
|
17
|
Dencker T, Leimeister CA, Gerth M, Bleidorn C, Snir S, Morgenstern B. 'Multi-SpaM': a maximum-likelihood approach to phylogeny reconstruction using multiple spaced-word matches and quartet trees. NAR Genom Bioinform 2020; 2:lqz013. [PMID: 33575565 PMCID: PMC7671388 DOI: 10.1093/nargab/lqz013] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 07/31/2019] [Accepted: 10/13/2019] [Indexed: 02/03/2023] Open
Abstract
Word-based or 'alignment-free' methods for phylogeny inference have become popular in recent years. These methods are much faster than traditional, alignment-based approaches, but they are generally less accurate. Most alignment-free methods calculate 'pairwise' distances between nucleic-acid or protein sequences; these distance values can then be used as input for tree-reconstruction programs such as neighbor-joining. In this paper, we propose the first word-based phylogeny approach that is based on 'multiple' sequence comparison and 'maximum likelihood'. Our algorithm first samples small, gap-free alignments involving four taxa each. For each of these alignments, it then calculates a quartet tree and, finally, the program 'Quartet MaxCut' is used to infer a super tree for the full set of input taxa from the calculated quartet trees. Experimental results show that trees produced with our approach are of high quality.
Collapse
Affiliation(s)
- Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Chris-André Leimeister
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Michael Gerth
- Institute for Integrative Biology, University of Liverpool, Biosciences Building, Crown Street, L69 7ZB Liverpool, UK
| | - Christoph Bleidorn
- Department of Animal Evolution and Biodiversity, Universität Göttingen, Untere Karspüle 2, 37073 Göttingen, Germany
- Museo Nacional de Ciencias Naturales, Spanish National Research Council (CSIC), 28006 Madrid, Spain
| | - Sagi Snir
- Institute of Evolution, Department of Evolutionary and Environmental Biology, University of Haifa, 199 Aba Khoushy Ave. Mount Carmel, Haifa, Israel
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Justus-von-Liebig-Weg 11, 37077 Göttingen, Germany
| |
Collapse
|
18
|
Röhling S, Linne A, Schellhorn J, Hosseini M, Dencker T, Morgenstern B. The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. PLoS One 2020; 15:e0228070. [PMID: 32040534 PMCID: PMC7010260 DOI: 10.1371/journal.pone.0228070] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Accepted: 01/08/2020] [Indexed: 12/14/2022] Open
Abstract
We study the number Nk of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences-i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor-can be estimated from the slope of a function F that depends on Nk and that is affine-linear within a certain range of k. Integers kmin and kmax can be calculated depending on the length of the input sequences, such that the slope of F in the relevant range can be estimated from the values F(kmin) and F(kmax). This approach can be generalized to so-called Spaced-word Matches (SpaM), where mismatches are allowed at positions specified by a user-defined binary pattern. Based on these theoretical results, we implemented a prototype software program for alignment-free sequence comparison called Slope-SpaM. Test runs on real and simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic distances for distances up to around 0.5 substitutions per position. The statistical stability of our results is improved if spaced words are used instead of contiguous words. Unlike previous alignment-free methods that are based on the number of (spaced) word matches, Slope-SpaM produces accurate results, even if sequences share only local homologies.
Collapse
Affiliation(s)
- Sophie Röhling
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Alexander Linne
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Jendrik Schellhorn
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | | | - Thomas Dencker
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Burkhard Morgenstern
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Göttingen, Germany
| |
Collapse
|
19
|
Bernard G, Chan CX, Chan YB, Chua XY, Cong Y, Hogan JM, Maetschke SR, Ragan MA. Alignment-free inference of hierarchical and reticulate phylogenomic relationships. Brief Bioinform 2019; 20:426-435. [PMID: 28673025 PMCID: PMC6433738 DOI: 10.1093/bib/bbx067] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2017] [Revised: 05/04/2017] [Indexed: 11/22/2022] Open
Abstract
We are amidst an ongoing flood of sequence data arising from the application of high-throughput technologies, and a concomitant fundamental revision in our understanding of how genomes evolve individually and within the biosphere. Workflows for phylogenomic inference must accommodate data that are not only much larger than before, but often more error prone and perhaps misassembled, or not assembled in the first place. Moreover, genomes of microbes, viruses and plasmids evolve not only by tree-like descent with modification but also by incorporating stretches of exogenous DNA. Thus, next-generation phylogenomics must address computational scalability while rethinking the nature of orthogroups, the alignment of multiple sequences and the inference and comparison of trees. New phylogenomic workflows have begun to take shape based on so-called alignment-free (AF) approaches. Here, we review the conceptual foundations of AF phylogenetics for the hierarchical (vertical) and reticulate (lateral) components of genome evolution, focusing on methods based on k-mers. We reflect on what seems to be successful, and on where further development is needed.
Collapse
|
20
|
Zielezinski A, Girgis HZ, Bernard G, Leimeister CA, Tang K, Dencker T, Lau AK, Röhling S, Choi JJ, Waterman MS, Comin M, Kim SH, Vinga S, Almeida JS, Chan CX, James BT, Sun F, Morgenstern B, Karlowski WM. Benchmarking of alignment-free sequence comparison methods. Genome Biol 2019; 20:144. [PMID: 31345254 PMCID: PMC6659240 DOI: 10.1186/s13059-019-1755-7] [Citation(s) in RCA: 101] [Impact Index Per Article: 20.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 07/03/2019] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. RESULTS Here, we present a community resource (http://afproject.org) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference, and reconstruction of species trees under horizontal gene transfer and recombination events. CONCLUSION The interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University Poznan, Uniwersytetu Poznańskiego 6, 61-614, Poznan, Poland
| | - Hani Z Girgis
- Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK, 74104, USA
| | | | - Chris-Andre Leimeister
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Kujin Tang
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
| | - Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Anna Katharina Lau
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Sophie Röhling
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Jae Jin Choi
- Department of Chemistry, University of California, Berkeley, CA, 94720, USA
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Michael S Waterman
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China
| | - Matteo Comin
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Sung-Hou Kim
- Department of Chemistry, University of California, Berkeley, CA, 94720, USA
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Susana Vinga
- INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
| | - Jonas S Almeida
- Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NIH/NCI), Bethesda, USA
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, and School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Benjamin T James
- Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK, 74104, USA
| | - Fengzhu Sun
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Wojciech M Karlowski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University Poznan, Uniwersytetu Poznańskiego 6, 61-614, Poznan, Poland.
| |
Collapse
|
21
|
Criscuolo A. A fast alignment-free bioinformatics procedure to infer accurate distance-based phylogenetic trees from genome assemblies. RESEARCH IDEAS AND OUTCOMES 2019. [DOI: 10.3897/rio.5.e36178] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
This paper describes a novel alignment-free distance-based procedure for inferring phylogenetic trees from genome contig sequences using publicly available bioinformatics tools. For each pair of genomes, a dissimilarity measure is first computed and next transformed to obtain an estimation of the number of substitution events that have occurred during their evolution. These pairwise evolutionary distances are then used to infer a phylogenetic tree and assess a confidence support for each internal branch. Analyses of both simulated and real genome datasets show that this bioinformatics procedure allows accurate phylogenetic trees to be reconstructed with fast running times, especially when launched on multiple threads. Implemented in a publicly available script, named JolyTree, this procedure is a useful approach for quickly inferring species trees without the burden and potential biases of multiple sequence alignments.
Collapse
|
22
|
Miller JB, McKinnon LM, Whiting MF, Ridge PG. CAM: an alignment-free method to recover phylogenies using codon aversion motifs. PeerJ 2019; 7:e6984. [PMID: 31198636 PMCID: PMC6555396 DOI: 10.7717/peerj.6984] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2018] [Accepted: 04/17/2019] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Common phylogenomic approaches for recovering phylogenies are often time-consuming and require annotations for orthologous gene relationships that are not always available. In contrast, alignment-free phylogenomic approaches typically use structure and oligomer frequencies to calculate pairwise distances between species. We have developed an approach to quickly calculate distances between species based on codon aversion. METHODS Utilizing a novel alignment-free character state, we present CAM, an alignment-free approach to recover phylogenies by comparing differences in codon aversion motifs (i.e., the set of unused codons within each gene) across all genes within a species. Synonymous codon usage is non-random and differs between organisms, between genes, and even within a single gene, and many genes do not use all possible codons. We report a comprehensive analysis of codon aversion within 229,742,339 genes from 23,428 species across all kingdoms of life, and we provide an alignment-free framework for its use in a phylogenetic construct. For each species, we first construct a set of codon aversion motifs spanning all genes within that species. We define the pairwise distance between two species, A and B, as one minus the number of shared codon aversion motifs divided by the total codon aversion motifs of the species, A or B, containing the fewest motifs. This approach allows us to calculate pairwise distances even when substantial differences in the number of genes or a high rate of divergence between species exists. Finally, we use neighbor-joining to recover phylogenies. RESULTS Using the Open Tree of Life and NCBI Taxonomy Database as expected phylogenies, our approach compares well, recovering phylogenies that largely match expected trees and are comparable to trees recovered using maximum likelihood and other alignment-free approaches. Our technique is much faster than maximum likelihood and similar in accuracy to other alignment-free approaches. Therefore, we propose that codon aversion be considered a phylogenetically conserved character that may be used in future phylogenomic studies. AVAILABILITY CAM, documentation, and test files are freely available on GitHub at https://github.com/ridgelab/cam.
Collapse
Affiliation(s)
- Justin B. Miller
- Department of Biology, Brigham Young University, Provo, UT, United States of America
| | - Lauren M. McKinnon
- Department of Biology, Brigham Young University, Provo, UT, United States of America
| | - Michael F. Whiting
- Department of Biology, Brigham Young University, Provo, UT, United States of America
- Brigham Young University, M.L. Bean Museum, Provo, UT, United States of America
| | - Perry G. Ridge
- Department of Biology, Brigham Young University, Provo, UT, United States of America
| |
Collapse
|
23
|
Farkaš T, Sitarčík J, Brejová B, Lucká M. SWSPM: A Novel Alignment-Free DNA Comparison Method Based on Signal Processing Approaches. Evol Bioinform Online 2019; 15:1176934319849071. [PMID: 31210725 PMCID: PMC6545658 DOI: 10.1177/1176934319849071] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2019] [Accepted: 04/12/2019] [Indexed: 11/16/2022] Open
Abstract
Computing similarity between 2 nucleotide sequences is one of the fundamental problems in bioinformatics. Current methods are based mainly on 2 major approaches: (1) sequence alignment, which is computationally expensive, and (2) faster, but less accurate, alignment-free methods based on various statistical summaries, for example, short word counts. We propose a new distance measure based on mathematical transforms from the domain of signal processing. To tolerate large-scale rearrangements in the sequences, the transform is computed across sliding windows. We compare our method on several data sets with current state-of-art alignment-free methods. Our method compares favorably in terms of accuracy and outperforms other methods in running time and memory requirements. In addition, it is massively scalable up to dozens of processing units without the loss of performance due to communication overhead. Source files and sample data are available at https://bitbucket.org/fiitstubioinfo/swspm/src.
Collapse
Affiliation(s)
- Tomáš Farkaš
- Faculty of Informatics and Information Technologies, Slovak University of Technology in Bratislava, Bratislava, Slovakia
| | - Jozef Sitarčík
- Faculty of Informatics and Information Technologies, Slovak University of Technology in Bratislava, Bratislava, Slovakia
| | - Broňa Brejová
- Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Bratislava, Slovakia
| | - Mária Lucká
- Faculty of Informatics and Information Technologies, Slovak University of Technology in Bratislava, Bratislava, Slovakia
| |
Collapse
|
24
|
Leimeister CA, Schellhorn J, Dörrer S, Gerth M, Bleidorn C, Morgenstern B. Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences. Gigascience 2019; 8:giy148. [PMID: 30535314 PMCID: PMC6436989 DOI: 10.1093/gigascience/giy148] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2018] [Revised: 09/10/2018] [Accepted: 11/20/2018] [Indexed: 11/20/2022] Open
Abstract
Word-based or 'alignment-free' sequence comparison has become an active research area in bioinformatics. While previous word-frequency approaches calculated rough measures of sequence similarity or dissimilarity, some new alignment-free methods are able to accurately estimate phylogenetic distances between genomic sequences. One of these approaches is Filtered Spaced Word Matches. Here, we extend this approach to estimate evolutionary distances between complete or incomplete proteomes; our implementation of this approach is called Prot-SpaM. We compare the performance of Prot-SpaM to other alignment-free methods on simulated sequences and on various groups of eukaryotic and prokaryotic taxa. Prot-SpaM can be used to calculate high-quality phylogenetic trees for dozens of whole-proteome sequences in a matter of seconds or minutes and often outperforms other alignment-free approaches. The source code of our software is available through Github: https://github.com/jschellh/ProtSpaM.
Collapse
Affiliation(s)
- Chris-Andre Leimeister
- University of Göttingen, Department of Bioinformatics, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Jendrik Schellhorn
- University of Göttingen, Department of Bioinformatics, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Svenja Dörrer
- University of Göttingen, Department of Bioinformatics, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Michael Gerth
- Institute for Integrative Biology, University of Liverpool, Biosciences Building, Crown Street, L69 7ZB Liverpool, UK
| | - Christoph Bleidorn
- University of Göttingen, Department of Animal Evolution and Biodiversity, Untere Karspüle 2, 37073 Göttingen, Germany
- Museo Nacional de Ciencias Naturales, Spanish National Research Council (CSIC), 28006 Madrid, Spain
| | - Burkhard Morgenstern
- University of Göttingen, Department of Bioinformatics, Goldschmidtstr. 1, 37077 Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Justus-von-Liebig-Weg 11, 37077 Göttingen
| |
Collapse
|
25
|
Sarmashghi S, Bohmann K, P. Gilbert MT, Bafna V, Mirarab S. Skmer: assembly-free and alignment-free sample identification using genome skims. Genome Biol 2019; 20:34. [PMID: 30760303 PMCID: PMC6374904 DOI: 10.1186/s13059-019-1632-4] [Citation(s) in RCA: 54] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2018] [Accepted: 01/16/2019] [Indexed: 01/10/2023] Open
Abstract
The ability to inexpensively describe taxonomic diversity is critical in this era of rapid climate and biodiversity changes. The recent genome-skimming approach extends current barcoding practices beyond short markers by applying low-pass sequencing and recovering whole organelle genomes computationally. This approach discards the nuclear DNA, which constitutes the vast majority of the data. In contrast, we suggest using all unassembled reads. We introduce an assembly-free and alignment-free tool, Skmer, to compute genomic distances between the query and reference genome skims. Skmer shows excellent accuracy in estimating distances and identifying the closest match in reference datasets.
Collapse
Affiliation(s)
- Shahab Sarmashghi
- Department of Electrical & Computer Engineering, University of California, San Diego, La Jolla, 92093 CA USA
| | - Kristine Bohmann
- Evolutionary Genomics, Natural History Museum of Denmark, University of Copenhagen, Copenhagen, Denmark
- School of Biological Sciences, University of East Anglia, Norwich, Norfolk UK
| | - M. Thomas P. Gilbert
- Evolutionary Genomics, Natural History Museum of Denmark, University of Copenhagen, Copenhagen, Denmark
- Norwegian University of Science and Technology, Trondheim, 7491 Norway
| | - Vineet Bafna
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, 92093 CA USA
| | - Siavash Mirarab
- Department of Electrical & Computer Engineering, University of California, San Diego, La Jolla, 92093 CA USA
| |
Collapse
|
26
|
Han GB, Cho DH. Genome classification improvements based on k-mer intervals in sequences. Genomics 2018; 111:1574-1582. [PMID: 30439480 DOI: 10.1016/j.ygeno.2018.11.001] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2018] [Revised: 10/13/2018] [Accepted: 11/05/2018] [Indexed: 10/27/2022]
Abstract
Given the vast amount of genomic data, alignment-free sequence comparison methods are required due to their low computational complexity. k-mer based methods can improve comparison accuracy by extracting an effective feature of the genome sequences. The aim of this paper is to extract k-mer intervals of a sequence as a feature of a genome for high comparison accuracy. In the proposed method, we calculated the distance between genome sequences by comparing the distribution of k-mer intervals. Then, we identified the classification results using phylogenetic trees. We used viral, mitochondrial (MT), microbial and mammalian genome sequences to perform classification for various genome sets. We confirmed that the proposed method provides a better classification result than other k-mer based methods. Furthermore, the proposed method could efficiently be applied to long sequences such as human and mouse genomes.
Collapse
Affiliation(s)
- Gyu-Bum Han
- School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, South Korea
| | - Dong-Ho Cho
- School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, South Korea.
| |
Collapse
|
27
|
Zheng W, Yang L, Genco RJ, Wactawski-Wende J, Buck M, Sun Y. SENSE: Siamese neural network for sequence embedding and alignment-free comparison. Bioinformatics 2018; 35:1820-1828. [PMID: 30346493 PMCID: PMC7963080 DOI: 10.1093/bioinformatics/bty887] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2018] [Revised: 09/04/2018] [Accepted: 10/18/2018] [Indexed: 02/06/2023] Open
Abstract
MOTIVATION Sequence analysis is arguably a foundation of modern biology. Classic approaches to sequence analysis are based on sequence alignment, which is limited when dealing with large-scale sequence data. A dozen of alignment-free approaches have been developed to provide computationally efficient alternatives to alignment-based approaches. However, existing methods define sequence similarity based on various heuristics and can only provide rough approximations to alignment distances. RESULTS In this article, we developed a new approach, referred to as SENSE (SiamEse Neural network for Sequence Embedding), for efficient and accurate alignment-free sequence comparison. The basic idea is to use a deep neural network to learn an explicit embedding function based on a small training dataset to project sequences into an embedding space so that the mean square error between alignment distances and pairwise distances defined in the embedding space is minimized. To the best of our knowledge, this is the first attempt to use deep learning for alignment-free sequence analysis. A large-scale experiment was performed that demonstrated that our method significantly outperformed the state-of-the-art alignment-free methods in terms of both efficiency and accuracy. AVAILABILITY AND IMPLEMENTATION Open-source software for the proposed method is developed and freely available at https://www.acsu.buffalo.edu/∼yijunsun/lab/SENSE.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Robert J Genco
- Department of Oral Biology, University at Buffalo, The State University of New York, Buffalo, NY, USA,Department of Microbiology and Immunology, University at Buffalo, The State University of New York, Buffalo, NY, USA
| | - Jean Wactawski-Wende
- Department of Epidemiology and Environmental Health, University at Buffalo, The State University of New York, Buffalo, NY, USA
| | - Michael Buck
- Department of Biochemistry, University at Buffalo, The State University of New York, Buffalo, NY, USA
| | - Yijun Sun
- To whom correspondence should be addressed.
| |
Collapse
|
28
|
Abstract
Evolutionary processes have been described not only in biology but also for a wide range of human cultural activities including languages and law. In contrast to the evolution of DNA or protein sequences, the detailed mechanisms giving rise to the observed evolution-like processes are not or only partially known. The absence of a mechanistic model of evolution implies that it remains unknown how the distances between different taxa have to be quantified. Considering distortions of metric distances, we first show that poor choices of the distance measure can lead to incorrect phylogenetic trees. Based on the well-known fact that phylogenetic inference requires additive metrics, we then show that the correct phylogeny can be computed from a distance matrix \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\mathbf {D}}$$\end{document}D if there is a monotonic, subadditive function \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\zeta$$\end{document}ζ such that \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\zeta ^{-1}({\mathbf {D}})$$\end{document}ζ-1(D) is additive. The required metric-preserving transformation \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\zeta$$\end{document}ζ can be computed as the solution of an optimization problem. This result shows that the problem of phylogeny reconstruction is well defined even if a detailed mechanistic model of the evolutionary process remains elusive.
Collapse
Affiliation(s)
- Nancy Retzlaff
- Max-Planck Institute for Mathematics in the Sciences, Inselstraße 22, 04103, Leipzig, Germany.,Interdisciplinary Center of Bioinformatics, University of Leipzig, Härtelstrasse 16-18, 04107, Leipzig, Germany
| | - Peter F Stadler
- Max-Planck Institute for Mathematics in the Sciences, Inselstraße 22, 04103, Leipzig, Germany. .,Interdisciplinary Center of Bioinformatics, University of Leipzig, Härtelstrasse 16-18, 04107, Leipzig, Germany. .,Bioinformatics Group, Department of Computer Science and Interdisciplinary Center of Bioinformatics, University of Leipzig, Härtelstrasse 16-18, 04107, Leipzig, Germany. .,Fraunhofer Institut für Zelltherapie und Immunologie - IZI, Perlickstraße 1, 04103, Leipzig, Germany. .,Institute for Theoretical Chemistry, University of Vienna, Währingerstraße 17, 1090, Vienna, Austria. .,Center for Non-coding RNA in Technology and Health, University of Copenhagen, Grønnegårdsvej 3, 1870, Frederiksberg C, Denmark. .,Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM, 87501, USA.
| |
Collapse
|
29
|
Morgenstern B, Schöbel S, Leimeister CA. Phylogeny reconstruction based on the length distribution of k-mismatch common substrings. Algorithms Mol Biol 2017; 12:27. [PMID: 29238399 PMCID: PMC5724348 DOI: 10.1186/s13015-017-0118-8] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2017] [Accepted: 11/28/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Various approaches to alignment-free sequence comparison are based on the length of exact or inexact word matches between pairs of input sequences. Haubold et al. (J Comput Biol 16:1487-1500, 2009) showed how the average number of substitutions per position between two DNA sequences can be estimated based on the average length of exact common substrings. RESULTS In this paper, we study the length distribution of k-mismatch common substrings between two sequences. We show that the number of substitutions per position can be accurately estimated from the position of a local maximum in the length distribution of their k-mismatch common substrings.
Collapse
Affiliation(s)
- Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Goettingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Svenja Schöbel
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Goettingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Chris-André Leimeister
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Goettingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| |
Collapse
|
30
|
Zhang Y, Alekseyenko AV. Phylogenic inference using alignment-free methods for applications in microbial community surveys using 16s rRNA gene. PLoS One 2017; 12:e0187940. [PMID: 29136663 PMCID: PMC5685621 DOI: 10.1371/journal.pone.0187940] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2017] [Accepted: 10/27/2017] [Indexed: 02/01/2023] Open
Abstract
The diversity of microbiota is best explored by understanding the phylogenetic structure of the microbial communities. Traditionally, sequence alignment has been used for phylogenetic inference. However, alignment-based approaches come with significant challenges and limitations when massive amounts of data are analyzed. In the recent decade, alignment-free approaches have enabled genome-scale phylogenetic inference. Here we evaluate three alignment-free methods: ACS, CVTree, and Kr for phylogenetic inference with 16s rRNA gene data. We use a taxonomic gold standard to compare the accuracy of alignment-free phylogenetic inference with that of common microbiome-wide phylogenetic inference pipelines based on PyNAST and MUSCLE alignments with FastTree and RAxML. We re-simulate fecal communities from Human Microbiome Project data to evaluate the performance of the methods on datasets with properties of real data. Our comparisons show that alignment-free methods are not inferior to alignment-based methods in giving accurate and robust phylogenic trees. Moreover, consensus ensembles of alignment-free phylogenies are superior to those built from alignment-based methods in their ability to highlight community differences in low power settings. In addition, the overall running times of alignment-based and alignment-free phylogenetic inference are comparable. Taken together our empirical results suggest that alignment-free methods provide a viable approach for microbiome-wide phylogenetic inference.
Collapse
Affiliation(s)
- Yifei Zhang
- Department of Medicine, New York University School of Medicine, New York, NY, United States of America
| | - Alexander V. Alekseyenko
- Department of Medicine, New York University School of Medicine, New York, NY, United States of America
- Biomedical Informatics Center, Departments of Public Health Sciences and Oral Health Sciences, Program for Human Microbiome Research, Medical University of South Carolina, Charleston, SC, United States of America
- * E-mail:
| |
Collapse
|
31
|
Chica C, Louis A, Roest Crollius H, Colot V, Roudier F. Comparative epigenomics in the Brassicaceae reveals two evolutionarily conserved modes of PRC2-mediated gene regulation. Genome Biol 2017; 18:207. [PMID: 29084582 PMCID: PMC5663038 DOI: 10.1186/s13059-017-1333-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2017] [Accepted: 10/03/2017] [Indexed: 01/05/2023] Open
Abstract
Background Polycomb Repressive Complexes 2 (PRC2) are multi-protein chromatin modifiers that are evolutionarily conserved among eukaryotes and play key roles in the regulation of gene expression, notably through the trimethylation of lysine 27 of histone H3 (H3K27me3). Although PRC2-mediated gene regulation has been studied in many organisms, few studies have explored in depth the evolutionary conservation of PRC2 targets. Results Here, we compare the H3K27me3 epigenomic profiles for the two closely related species Arabidopsis thaliana and Arabidopsis lyrata and the more distant species Arabis alpina, three Brassicaceae that diverged from each other within the past 24 million years. Using a robust set of gene orthologs present in the three species, we identify two classes of evolutionarily conserved PRC2 targets, which are characterized by either developmentally plastic or developmentally constrained H3K27me3 marking across species. Constrained H3K27me3 marking is associated with higher conservation of promoter sequence information content and higher nucleosome occupancy compared to plastic H3K27me3 marking. Moreover, gene orthologs with constrained H3K27me3 marking exhibit a higher degree of tissue specificity and tend to be involved in developmental functions, whereas gene orthologs with plastic H3K27me3 marking preferentially encode proteins associated with metabolism and stress responses. In addition, gene orthologs with constrained H3K27me3 marking are the predominant contributors to higher-order chromosome organization. Conclusions Our findings indicate that developmentally plastic and constrained H3K27me3 marking define two evolutionarily conserved modes of PRC2-mediated gene regulation that are associated with distinct selective pressures operating at multiple scales, from DNA sequence to gene function and chromosome architecture. Electronic supplementary material The online version of this article (doi:10.1186/s13059-017-1333-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Claudia Chica
- Institut de Biologie de l'Ecole Normale Supérieure (IBENS), Ecole Normale Supérieure, Centre National de la Recherche Scientifique (CNRS), Institut National de la Santé et de la Recherche Médicale (INSERM), Paris, F-75005, France.,Present address: Institut Pasteur, Bioinformatics and Biostatistics Hub, C3BI, USR 3756 IP CNRS, Paris, France
| | - Alexandra Louis
- Institut de Biologie de l'Ecole Normale Supérieure (IBENS), Ecole Normale Supérieure, Centre National de la Recherche Scientifique (CNRS), Institut National de la Santé et de la Recherche Médicale (INSERM), Paris, F-75005, France
| | - Hugues Roest Crollius
- Institut de Biologie de l'Ecole Normale Supérieure (IBENS), Ecole Normale Supérieure, Centre National de la Recherche Scientifique (CNRS), Institut National de la Santé et de la Recherche Médicale (INSERM), Paris, F-75005, France
| | - Vincent Colot
- Institut de Biologie de l'Ecole Normale Supérieure (IBENS), Ecole Normale Supérieure, Centre National de la Recherche Scientifique (CNRS), Institut National de la Santé et de la Recherche Médicale (INSERM), Paris, F-75005, France.
| | - François Roudier
- Institut de Biologie de l'Ecole Normale Supérieure (IBENS), Ecole Normale Supérieure, Centre National de la Recherche Scientifique (CNRS), Institut National de la Santé et de la Recherche Médicale (INSERM), Paris, F-75005, France. .,Present address: Laboratoire Reproduction et Développement des Plantes, Univ Lyon, ENS de Lyon, UCB Lyon 1, CNRS, INRA, F-69342, Lyon, France.
| |
Collapse
|
32
|
Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol 2017; 18:186. [PMID: 28974235 PMCID: PMC5627421 DOI: 10.1186/s13059-017-1319-7] [Citation(s) in RCA: 248] [Impact Index Per Article: 35.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
Alignment-free sequence analyses have been applied to problems ranging from whole-genome phylogeny to the classification of protein families, identification of horizontally transferred genes, and detection of recombined sequences. The strength of these methods makes them particularly useful for next-generation sequencing data processing and analysis. However, many researchers are unclear about how these methods work, how they compare to alignment-based methods, and what their potential is for use for their research. We address these questions and provide a guide to the currently available alignment-free sequence analysis tools.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University in Poznan, Umultowska 89, 61-614, Poznan, Poland
| | - Susana Vinga
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
| | - Jonas Almeida
- Stony Brook University (SUNY), 101 Nicolls Road, Stony Brook, NY, 11794, USA
| | - Wojciech M Karlowski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University in Poznan, Umultowska 89, 61-614, Poznan, Poland.
| |
Collapse
|
33
|
Thankachan SV, Chockalingam SP, Liu Y, Krishnan A, Aluru S. A greedy alignment-free distance estimator for phylogenetic inference. BMC Bioinformatics 2017; 18:238. [PMID: 28617225 PMCID: PMC5471951 DOI: 10.1186/s12859-017-1658-0] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND Alignment-free sequence comparison approaches have been garnering increasing interest in various data- and compute-intensive applications such as phylogenetic inference for large-scale sequences. While k-mer based methods are predominantly used in real applications, the average common substring (ACS) approach is emerging as one of the prominent alignment-free approaches. This ACS approach has been further generalized by some recent work, either greedily or exactly, by allowing a bounded number of mismatches in the common substrings. RESULTS We present ALFRED-G, a greedy alignment-free distance estimator for phylogenetic tree reconstruction based on the concept of the generalized ACS approach. In this algorithm, we have investigated a new heuristic to efficiently compute the lengths of common strings with mismatches allowed, and have further applied this heuristic to phylogeny reconstruction. Performance evaluation using real sequence datasets shows that our heuristic is able to reconstruct comparable, or even more accurate, phylogenetic tree topologies than the kmacs heuristic algorithm at highly competitive speed. CONCLUSIONS ALFRED-G is an alignment-free heuristic for evolutionary distance estimation between two biological sequences. This algorithm is implemented in C++ and has been incorporated into our open-source ALFRED software package ( http://alurulab.cc.gatech.edu/phylo ).
Collapse
Affiliation(s)
- Sharma V. Thankachan
- Department of Computer Science, University of Central Florida, Orlando, 32816 FL USA
| | - Sriram P. Chockalingam
- Institute for Data Engineering and Science, Georgia Institute of Technology, Atlanta, 30332 GA USA
| | - Yongchao Liu
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, 30332 GA USA
| | - Ambujam Krishnan
- School of Electrical Engineering and Computer Science, Louisiana State University, Baton Rouge, 70703 LA USA
| | - Srinivas Aluru
- Institute for Data Engineering and Science, Georgia Institute of Technology, Atlanta, 30332 GA USA
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, 30332 GA USA
| |
Collapse
|
34
|
Leimeister CA, Sohrabi-Jahromi S, Morgenstern B. Fast and accurate phylogeny reconstruction using filtered spaced-word matches. Bioinformatics 2017; 33:971-979. [PMID: 28073754 PMCID: PMC5409309 DOI: 10.1093/bioinformatics/btw776] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2016] [Accepted: 12/02/2016] [Indexed: 11/13/2022] Open
Abstract
Motivation Word-based or ‘alignment-free’ algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignment-based methods. Results We propose Filtered Spaced Word Matches (FSWM), a fast alignment-free approach to estimate phylogenetic distances between large genomic sequences. For a pre-defined binary pattern of match and don’t-care positions, FSWM rapidly identifies spaced word-matches between input sequences, i.e. gap-free local alignments with matching nucleotides at the match positions and with mismatches allowed at the don’t-care positions. We then estimate the number of nucleotide substitutions per site by considering the nucleotides aligned at the don’t-care positions of the identified spaced-word matches. To reduce the noise from spurious random matches, we use a filtering procedure where we discard all spaced-word matches for which the overall similarity between the aligned segments is below a threshold. We show that our approach can accurately estimate substitution frequencies even for distantly related sequences that cannot be analyzed with existing alignment-free methods; phylogenetic trees constructed with FSWM distances are of high quality. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes a few minutes. Availability and Implementation The program source code for FSWM including a documentation, as well as the software that we used to generate artificial genome sequences are freely available at http://fswm.gobics.de/ Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chris-André Leimeister
- Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, Goldschmidtstr. 1, 37077?Göttingen, Germany
| | - Salma Sohrabi-Jahromi
- Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, Goldschmidtstr. 1, 37077?Göttingen, Germany
| | - Burkhard Morgenstern
- Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, Goldschmidtstr. 1, 37077 Göttingen, Germany.,University of Göttingen, Center for Computational Sciences, Goldschmidtstr. 1, 37077 Göttingen, Germany
| |
Collapse
|
35
|
Yang L, Zhang W. A Multiresolution Graphical Representation for Similarity Relationship and Multiresolution Clustering for Biological Sequences. J Comput Biol 2017; 24:299-310. [DOI: 10.1089/cmb.2016.0030] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Lianping Yang
- College of Sciences, Northeastern University, Shenyang, China
| | - Weilin Zhang
- Department of Mathematics, New York University Shanghai, Shanghai, China
| |
Collapse
|
36
|
Domazet-Lošo M, Domazet-Lošo T. gmos: Rapid Detection of Genome Mosaicism over Short Evolutionary Distances. PLoS One 2016; 11:e0166602. [PMID: 27846272 PMCID: PMC5112998 DOI: 10.1371/journal.pone.0166602] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2016] [Accepted: 11/01/2016] [Indexed: 12/12/2022] Open
Abstract
Prokaryotic and viral genomes are often altered by recombination and horizontal gene transfer. The existing methods for detecting recombination are primarily aimed at viral genomes or sets of loci, since the expensive computation of underlying statistical models often hinders the comparison of complete prokaryotic genomes. As an alternative, alignment-free solutions are more efficient, but cannot map (align) a query to subject genomes. To address this problem, we have developed gmos (Genome MOsaic Structure), a new program that determines the mosaic structure of query genomes when compared to a set of closely related subject genomes. The program first computes local alignments between query and subject genomes and then reconstructs the query mosaic structure by choosing the best local alignment for each query region. To accomplish the analysis quickly, the program mostly relies on pairwise alignments and constructs multiple sequence alignments over short overlapping subject regions only when necessary. This fine-tuned implementation achieves an efficiency comparable to an alignment-free tool. The program performs well for simulated and real data sets of closely related genomes and can be used for fast recombination detection; for instance, when a new prokaryotic pathogen is discovered. As an example, gmos was used to detect genome mosaicism in a pathogenic Enterococcus faecium strain compared to seven closely related genomes. The analysis took less than two minutes on a single 2.1 GHz processor. The output is available in fasta format and can be visualized using an accessory program, gmosDraw (freely available with gmos).
Collapse
Affiliation(s)
- Mirjana Domazet-Lošo
- Department of Applied Computing, Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia
- * E-mail:
| | - Tomislav Domazet-Lošo
- Laboratory of Evolutionary Genetics, Ruđer Bošković Institute, Zagreb, Croatia
- Catholic University of Croatia, Zagreb, Croatia
| |
Collapse
|
37
|
Ultra Large Gene Families: A Matter of Adaptation or Genomic Parasites? Life (Basel) 2016; 6:life6030032. [PMID: 27509525 PMCID: PMC5041008 DOI: 10.3390/life6030032] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2016] [Revised: 06/27/2016] [Accepted: 07/20/2016] [Indexed: 01/17/2023] Open
Abstract
Gene duplication is an important mechanism of molecular evolution. It offers a fast track to modification, diversification, redundancy or rescue of gene function. However, duplication may also be neutral or (slightly) deleterious, and often ends in pseudo-geneisation. Here, we investigate the phylogenetic distribution of ultra large gene families on long and short evolutionary time scales. In particular, we focus on a family of NACHT-domain and leucine-rich-repeat-containing (NLR)-genes, which we previously found in large numbers to occupy one chromosome arm of the zebrafish genome. We were interested to see whether such a tight clustering is characteristic for ultra large gene families. Our data reconfirm that most gene family inflations are lineage-specific, but we can only identify very few gene clusters. Based on our observations we hypothesise that, beyond a certain size threshold, ultra large gene families continue to proliferate in a mechanism we term “run-away evolution”. This process might ultimately lead to the failure of genomic integrity and drive species to extinction.
Collapse
|
38
|
Bernard G, Chan CX, Ragan MA. Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer. Sci Rep 2016; 6:28970. [PMID: 27363362 PMCID: PMC4929450 DOI: 10.1038/srep28970] [Citation(s) in RCA: 42] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2016] [Accepted: 06/13/2016] [Indexed: 12/22/2022] Open
Abstract
Alignment-free (AF) approaches have recently been highlighted as alternatives to methods based on multiple sequence alignment in phylogenetic inference. However, the sensitivity of AF methods to genome-scale evolutionary scenarios is little known. Here, using simulated microbial genome data we systematically assess the sensitivity of nine AF methods to three important evolutionary scenarios: sequence divergence, lateral genetic transfer (LGT) and genome rearrangement. Among these, AF methods are most sensitive to the extent of sequence divergence, less sensitive to low and moderate frequencies of LGT, and most robust against genome rearrangement. We describe the application of AF methods to three well-studied empirical genome datasets, and introduce a new application of the jackknife to assess node support. Our results demonstrate that AF phylogenomics is computationally scalable to multi-genome data and can generate biologically meaningful phylogenies and insights into microbial evolution.
Collapse
Affiliation(s)
- Guillaume Bernard
- Institute for Molecular Bioscience, and ARC Centre of Excellence in Bioinformatics, The University of Queensland, Brisbane, QLD 4072, Australia
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, and ARC Centre of Excellence in Bioinformatics, The University of Queensland, Brisbane, QLD 4072, Australia
| | - Mark A. Ragan
- Institute for Molecular Bioscience, and ARC Centre of Excellence in Bioinformatics, The University of Queensland, Brisbane, QLD 4072, Australia
| |
Collapse
|
39
|
Bromberg R, Grishin NV, Otwinowski Z. Phylogeny Reconstruction with Alignment-Free Method That Corrects for Horizontal Gene Transfer. PLoS Comput Biol 2016; 12:e1004985. [PMID: 27336403 PMCID: PMC4918981 DOI: 10.1371/journal.pcbi.1004985] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2015] [Accepted: 05/10/2016] [Indexed: 01/20/2023] Open
Abstract
Advances in sequencing have generated a large number of complete genomes. Traditionally, phylogenetic analysis relies on alignments of orthologs, but defining orthologs and separating them from paralogs is a complex task that may not always be suited to the large datasets of the future. An alternative to traditional, alignment-based approaches are whole-genome, alignment-free methods. These methods are scalable and require minimal manual intervention. We developed SlopeTree, a new alignment-free method that estimates evolutionary distances by measuring the decay of exact substring matches as a function of match length. SlopeTree corrects for horizontal gene transfer, for composition variation and low complexity sequences, and for branch-length nonlinearity caused by multiple mutations at the same site. We tested SlopeTree on 495 bacteria, 73 archaea, and 72 strains of Escherichia coli and Shigella. We compared our trees to the NCBI taxonomy, to trees based on concatenated alignments, and to trees produced by other alignment-free methods. The results were consistent with current knowledge about prokaryotic evolution. We assessed differences in tree topology over different methods and settings and found that the majority of bacteria and archaea have a core set of proteins that evolves by descent. In trees built from complete genomes rather than sets of core genes, we observed some grouping by phenotype rather than phylogeny, for instance with a cluster of sulfur-reducing thermophilic bacteria coming together irrespective of their phyla. The source-code for SlopeTree is available at: http://prodata.swmed.edu/download/pub/slopetree_v1/slopetree.tar.gz. Due to their lack of distinct morphological features, bacteria and archaea were extremely difficult to classify until technology was developed to obtain their DNA sequences; these sequences could then be compared to estimate evolutionary relationships. Now, due to technological advances, there is a flood of available sequences from a wide variety of organisms. These advances have spurred the development of algorithms which can estimate evolutionary relationships using whole genomes, in contrast to the more traditional methods which used single genes earlier and now typically use groups of conserved genes. However, there are many challenges when attempting to infer evolutionary relationships, in particular horizontal gene transfer, where DNA is transferred from one organism to another, resulting in an organism’s genome containing DNA that does not reflect its evolution by descent. We developed a new whole-genome method for estimating evolutionary distances which identifies and corrects for horizontal transfer. We found that for SlopeTree and all other whole-genome methods we applied, horizontal transfer causes some evolutionary distances to be grossly underestimated, and that our correction corrects for this.
Collapse
Affiliation(s)
- Raquel Bromberg
- Department of Biophysics and Department of Biochemistry, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas, United States of America
| | - Nick V. Grishin
- Department of Biophysics and Department of Biochemistry, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas, United States of America
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas, United States of America
| | - Zbyszek Otwinowski
- Department of Biophysics and Department of Biochemistry, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas, United States of America
- * E-mail:
| |
Collapse
|
40
|
Thankachan SV, Chockalingam SP, Liu Y, Apostolico A, Aluru S. ALFRED: A Practical Method for Alignment-Free Distance Computation. J Comput Biol 2016; 23:452-60. [PMID: 27138275 DOI: 10.1089/cmb.2015.0217] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Alignment-free approaches are gaining persistent interest in many sequence analysis applications such as phylogenetic inference and metagenomic classification/clustering, especially for large-scale sequence datasets. Besides the widely used k-mer methods, the average common substring (ACS) approach has emerged to be one of the well-known alignment-free approaches. Two recent works further generalize this ACS approach by allowing a bounded number k of mismatches in the common substrings, relying on approximation (linear time) and exact computation, respectively. Albeit having a good worst-case time complexity [Formula: see text], the exact approach is complex and unlikely to be efficient in practice. Herein, we present ALFRED, an alignment-free distance computation method, which solves the generalized common substring search problem via exact computation. Compared to the theoretical approach, our algorithm is easier to implement and more practical to use, while still providing highly competitive theoretical performances with an expected run-time of [Formula: see text]. By applying our program to phylogenetic inference as a case study, we find that our program facilitates to exactly reconstruct the topology of the reference phylogenetic tree for a set of 27 primate mitochondrial genomes, at reasonably acceptable speed. ALFRED is implemented in C++ programming language and the source code is freely available online.
Collapse
Affiliation(s)
| | - Sriram P Chockalingam
- 2 Department of Computer Science and Engineering, Indian Institute of Technology Bombay , Mumbai, India
| | - Yongchao Liu
- 1 College of Computing, Georgia Institute of Technology , Atlanta, Georgia
| | - Alberto Apostolico
- 1 College of Computing, Georgia Institute of Technology , Atlanta, Georgia
| | - Srinivas Aluru
- 1 College of Computing, Georgia Institute of Technology , Atlanta, Georgia
| |
Collapse
|
41
|
Pizzi C. MissMax: alignment-free sequence comparison with mismatches through filtering and heuristics. Algorithms Mol Biol 2016; 11:6. [PMID: 27103940 PMCID: PMC4839165 DOI: 10.1186/s13015-016-0072-x] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2015] [Accepted: 01/08/2016] [Indexed: 11/11/2022] Open
Abstract
Background Measuring sequence similarity is central for many problems in bioinformatics. In several contexts alignment-free techniques based on exact occurrences of substrings are faster, but also less accurate, than alignment-based approaches. Recently, several studies attempted to bridge the accuracy gap with the introduction of approximate matches in the definition of composition-based similarity measures. Results In this work we present MissMax, an exact algorithm for the computation of the longest common substring with mismatches between each suffix of a sequence x and a sequence y. This collection of statistics is useful for the computation of two similarity measures: the longest and the average common substring with k mismatches. As a further contribution we provide a “relaxed” version of MissMax that does not guarantee the exact solution, but it is faster in practice and still very precise.
Collapse
|
42
|
An estimator for local analysis of genome based on the minimal absent word. J Theor Biol 2016; 395:23-30. [PMID: 26829314 DOI: 10.1016/j.jtbi.2016.01.023] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2015] [Revised: 01/17/2016] [Accepted: 01/19/2016] [Indexed: 11/22/2022]
Abstract
This study presents an alternative alignment-free relative feature analysis method based on the minimal absent word, which has potential advantages over the local alignment method in local analysis. Smooth-local-analysis-curve and similarity-distribution are constructed for a fast, efficient, and visual comparison. Moreover, when the multi-sequence-comparison is needed, the local-analysis-curves can illustrate some interesting zones.
Collapse
|
43
|
Yang WF, Yu ZG, Anh V. Whole genome/proteome based phylogeny reconstruction for prokaryotes using higher order Markov model and chaos game representation. Mol Phylogenet Evol 2015; 96:102-111. [PMID: 26724405 DOI: 10.1016/j.ympev.2015.12.011] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2015] [Revised: 12/17/2015] [Accepted: 12/18/2015] [Indexed: 01/18/2023]
Abstract
UNLABELLED Traditional methods for sequence comparison and phylogeny reconstruction rely on pair wise and multiple sequence alignments. But alignment could not be directly applied to whole genome/proteome comparison and phylogenomic studies due to their high computational complexity. Hence alignment-free methods became popular in recent years. Here we propose a fast alignment-free method for whole genome/proteome comparison and phylogeny reconstruction using higher order Markov model and chaos game representation. In the present method, we use the transition matrices of higher order Markov models to characterize amino acid or DNA sequences for their comparison. The order of the Markov model is uniquely identified by maximizing the average Shannon entropy of conditional probability distributions. Using one-dimensional chaos game representation and linked list, this method can reduce large memory and time consumption which is due to the large-scale conditional probability distributions. To illustrate the effectiveness of our method, we employ it for fast phylogeny reconstruction based on genome/proteome sequences of two species data sets used in previous published papers. Our results demonstrate that the present method is useful and efficient. AVAILABILITY AND IMPLEMENTATION The source codes for our algorithm to get the distance matrix and genome/proteome sequences can be downloaded from ftp://121.199.20.25/. The software Phylip and EvolView we used to construct phylogenetic trees can be referred from their websites.
Collapse
Affiliation(s)
- Wei-Feng Yang
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China; Department of Mathematics and Physics, Hunan Institute of Engineering, Hunan 411104, PR China.
| | - Zu-Guo Yu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China; School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia.
| | - Vo Anh
- School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia.
| |
Collapse
|
44
|
Xie XH, Yu ZG, Han GS, Yang WF, Anh V. Whole-proteome based phylogenetic tree construction with inter-amino-acid distances and the conditional geometric distribution profiles. Mol Phylogenet Evol 2015; 89:37-45. [PMID: 25882834 DOI: 10.1016/j.ympev.2015.04.008] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2014] [Revised: 03/29/2015] [Accepted: 04/06/2015] [Indexed: 11/18/2022]
Abstract
There has been a growing interest in alignment-free methods for whole genome comparison and phylogenomic studies. In this study, we propose an alignment-free method for phylogenetic tree construction using whole-proteome sequences. Based on the inter-amino-acid distances, we first convert the whole-proteome sequences into inter-amino-acid distance vectors, which are called observed inter-amino-acid distance profiles. Then, we propose to use conditional geometric distribution profiles (the distributions of sequences where the amino acids are placed randomly and independently) as the reference distribution profiles. Last the relative deviation between the observed and reference distribution profiles is used to define a simple metric that reflects the phylogenetic relationships between whole-proteome sequences of different organisms. We name our method inter-amino-acid distances and conditional geometric distribution profiles (IAGDP). We evaluate our method on two data sets: the benchmark dataset including 29 genomes used in previous published papers, and another one including 67 mammal genomes. Our results demonstrate that the new method is useful and efficient.
Collapse
Affiliation(s)
- Xian-Hua Xie
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China; School of Mathematics and Computer Science, Gannan Normal University, Jiangxi 341000, PR China.
| | - Zu-Guo Yu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China; School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia.
| | - Guo-Sheng Han
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China.
| | - Wei-Feng Yang
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, PR China.
| | - Vo Anh
- School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia.
| |
Collapse
|
45
|
Morgenstern B, Zhu B, Horwege S, Leimeister CA. Estimating evolutionary distances between genomic sequences from spaced-word matches. Algorithms Mol Biol 2015; 10:5. [PMID: 25685176 PMCID: PMC4327811 DOI: 10.1186/s13015-015-0032-x] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2014] [Accepted: 01/06/2015] [Indexed: 01/06/2023] Open
Abstract
Alignment-free methods are increasingly used to calculate evolutionary distances between DNA and protein sequences as a basis of phylogeny reconstruction. Most of these methods, however, use heuristic distance functions that are not based on any explicit model of molecular evolution. Herein, we propose a simple estimator d N of the evolutionary distance between two DNA sequences that is calculated from the number N of (spaced) word matches between them. We show that this distance function is more accurate than other distance measures that are used by alignment-free methods. In addition, we calculate the variance of the normalized number N of (spaced) word matches. We show that the variance of N is smaller for spaced words than for contiguous words, and that the variance is further reduced if our spaced-words approach is used with multiple patterns of 'match positions' and 'don't care positions'. Our software is available online and as downloadable source code at: http://spaced.gobics.de/.
Collapse
|
46
|
Haubold B, Klötzl F, Pfaffelhuber P. andi: fast and accurate estimation of evolutionary distances between closely related genomes. ACTA ACUST UNITED AC 2014; 31:1169-75. [PMID: 25504847 DOI: 10.1093/bioinformatics/btu815] [Citation(s) in RCA: 57] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2014] [Accepted: 12/07/2014] [Indexed: 11/13/2022]
Abstract
MOTIVATION A standard approach to classifying sets of genomes is to calculate their pairwise distances. This is difficult for large samples. We have therefore developed an algorithm for rapidly computing the evolutionary distances between closely related genomes. RESULTS Our distance measure is based on ungapped local alignments that we anchor through pairs of maximal unique matches of a minimum length. These exact matches can be looked up efficiently using enhanced suffix arrays and our implementation requires approximately only 1 s and 45 MB RAM/Mbase analysed. The pairing of matches distinguishes non-homologous from homologous regions leading to accurate distance estimation. We show this by analysing simulated data and genome samples ranging from 29 Escherichia coli/Shigella genomes to 3085 genomes of Streptococcus pneumoniae. AVAILABILITY AND IMPLEMENTATION We have implemented the computation of anchor distances in the multithreaded UNIX command-line program andi for ANchor DIstances. C sources and documentation are posted at http://github.com/evolbioinf/andi/ CONTACT haubold@evolbio.mpg.de SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bernhard Haubold
- Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, 24306 Plön, Germany, Institue for Neuro- and Bioinformatics, Lübeck University, 23562 Lübeck, Germany and Mathematical Stochastics, Mathematical Institute, Freiburg University, Germany
| | - Fabian Klötzl
- Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, 24306 Plön, Germany, Institue for Neuro- and Bioinformatics, Lübeck University, 23562 Lübeck, Germany and Mathematical Stochastics, Mathematical Institute, Freiburg University, Germany Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, 24306 Plön, Germany, Institue for Neuro- and Bioinformatics, Lübeck University, 23562 Lübeck, Germany and Mathematical Stochastics, Mathematical Institute, Freiburg University, Germany
| | - Peter Pfaffelhuber
- Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, 24306 Plön, Germany, Institue for Neuro- and Bioinformatics, Lübeck University, 23562 Lübeck, Germany and Mathematical Stochastics, Mathematical Institute, Freiburg University, Germany
| |
Collapse
|
47
|
Horwege S, Lindner S, Boden M, Hatje K, Kollmar M, Leimeister CA, Morgenstern B. Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Res 2014; 42:W7-11. [PMID: 24829447 PMCID: PMC4086093 DOI: 10.1093/nar/gku398] [Citation(s) in RCA: 58] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
In this article, we present a user-friendly web interface for two alignment-free sequence-comparison methods that we recently developed. Most alignment-free methods rely on exact word matches to estimate pairwise similarities or distances between the input sequences. By contrast, our new algorithms are based on inexact word matches. The first of these approaches uses the relative frequencies of so-called spaced words in the input sequences, i.e. words containing ‘don't care’ or ‘wildcard’ symbols at certain pre-defined positions. Various distance measures can then be defined on sequences based on their different spaced-word composition. Our second approach defines the distance between two sequences by estimating for each position in the first sequence the length of the longest substring at this position that also occurs in the second sequence with up to k mismatches. Both approaches take a set of deoxyribonucleic acid (DNA) or protein sequences as input and return a matrix of pairwise distance values that can be used as a starting point for clustering algorithms or distance-based phylogeny reconstruction. The two alignment-free programmes are accessible through a web interface at ‘Göttingen Bioinformatics Compute Server (GOBICS)’: http://spaced.gobics.dehttp://kmacs.gobics.de and the source codes can be downloaded.
Collapse
Affiliation(s)
- Sebastian Horwege
- University of Göttingen, Institute of Microbiology and Genetics, Department of Bioinformatics, Goldschmidtstraße 1, 37073 Göttingen, Germany
| | - Sebastian Lindner
- University of Göttingen, Institute of Microbiology and Genetics, Department of Bioinformatics, Goldschmidtstraße 1, 37073 Göttingen, Germany
| | - Marcus Boden
- University of Göttingen, Institute of Microbiology and Genetics, Department of Bioinformatics, Goldschmidtstraße 1, 37073 Göttingen, Germany
| | - Klas Hatje
- Max-Planck-Institute for Biophysical Chemistry, Department of NMR-based Structural Biology, Group Systems Biology of Motor Proteins, Am Fassberg 11, 37077 Göttingen, Germany
| | - Martin Kollmar
- Max-Planck-Institute for Biophysical Chemistry, Department of NMR-based Structural Biology, Group Systems Biology of Motor Proteins, Am Fassberg 11, 37077 Göttingen, Germany
| | - Chris-André Leimeister
- University of Göttingen, Institute of Microbiology and Genetics, Department of Bioinformatics, Goldschmidtstraße 1, 37073 Göttingen, Germany
| | - Burkhard Morgenstern
- University of Göttingen, Institute of Microbiology and Genetics, Department of Bioinformatics, Goldschmidtstraße 1, 37073 Göttingen, Germany Université d'Évry Val d'Essonne, Laboratoire Statistique et Génome, UMR CNRS 8071, USC INRA, 23 Boulevard de France, 91037 Évry, France
| |
Collapse
|
48
|
Leimeister CA, Morgenstern B. Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. ACTA ACUST UNITED AC 2014; 30:2000-8. [PMID: 24828656 PMCID: PMC4080746 DOI: 10.1093/bioinformatics/btu331] [Citation(s) in RCA: 92] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Alignment-based methods for sequence analysis have various limitations if large datasets are to be analysed. Therefore, alignment-free approaches have become popular in recent years. One of the best known alignment-free methods is the average common substring approach that defines a distance measure on sequences based on the average length of longest common words between them. Herein, we generalize this approach by considering longest common substrings with k mismatches. We present a greedy heuristic to approximate the length of such k-mismatch substrings, and we describe kmacs, an efficient implementation of this idea based on generalized enhanced suffix arrays. RESULTS To evaluate the performance of our approach, we applied it to phylogeny reconstruction using a large number of DNA and protein sequence sets. In most cases, phylogenetic trees calculated with kmacs were more accurate than trees produced with established alignment-free methods that are based on exact word matches. Especially on protein sequences, our method seems to be superior. On simulated protein families, kmacs even outperformed a classical approach to phylogeny reconstruction using multiple alignment and maximum likelihood. AVAILABILITY AND IMPLEMENTATION kmacs is implemented in C++, and the source code is freely available at http://kmacs.gobics.de/.
Collapse
Affiliation(s)
- Chris-Andre Leimeister
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37073 Göttingen, Germany and Laboratoire Statistique et Génome, Université d'Évry Val d'Essonne, UMR CNRS 8071, USC INRA, 23 Boulevard de France, 91037 Évry, France
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37073 Göttingen, Germany and Laboratoire Statistique et Génome, Université d'Évry Val d'Essonne, UMR CNRS 8071, USC INRA, 23 Boulevard de France, 91037 Évry, FranceDepartment of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37073 Göttingen, Germany and Laboratoire Statistique et Génome, Université d'Évry Val d'Essonne, UMR CNRS 8071, USC INRA, 23 Boulevard de France, 91037 Évry, France
| |
Collapse
|
49
|
Leimeister CA, Boden M, Horwege S, Lindner S, Morgenstern B. Fast alignment-free sequence comparison using spaced-word frequencies. ACTA ACUST UNITED AC 2014; 30:1991-9. [PMID: 24700317 PMCID: PMC4080745 DOI: 10.1093/bioinformatics/btu177] [Citation(s) in RCA: 105] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Motivation: Alignment-free methods for sequence comparison are increasingly used for genome analysis and phylogeny reconstruction; they circumvent various difficulties of traditional alignment-based approaches. In particular, alignment-free methods are much faster than pairwise or multiple alignments. They are, however, less accurate than methods based on sequence alignment. Most alignment-free approaches work by comparing the word composition of sequences. A well-known problem with these methods is that neighbouring word matches are far from independent. Results: To reduce the statistical dependency between adjacent word matches, we propose to use ‘spaced words’, defined by patterns of ‘match’ and ‘don’t care’ positions, for alignment-free sequence comparison. We describe a fast implementation of this approach using recursive hashing and bit operations, and we show that further improvements can be achieved by using multiple patterns instead of single patterns. To evaluate our approach, we use spaced-word frequencies as a basis for fast phylogeny reconstruction. Using real-world and simulated sequence data, we demonstrate that our multiple-pattern approach produces better phylogenies than approaches relying on contiguous words. Availability and implementation: Our program is freely available at http://spaced.gobics.de/. Contact:chris.leimeister@stud.uni-goettingen.de Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chris-Andre Leimeister
- Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, 37073 Göttingen, Germany and Université d'Évry Val d'Essonne, Laboratoire Statistique et Génome, UMR CNRS 8071, USC INRA, 91037 Évry, France
| | - Marcus Boden
- Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, 37073 Göttingen, Germany and Université d'Évry Val d'Essonne, Laboratoire Statistique et Génome, UMR CNRS 8071, USC INRA, 91037 Évry, France
| | - Sebastian Horwege
- Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, 37073 Göttingen, Germany and Université d'Évry Val d'Essonne, Laboratoire Statistique et Génome, UMR CNRS 8071, USC INRA, 91037 Évry, France
| | - Sebastian Lindner
- Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, 37073 Göttingen, Germany and Université d'Évry Val d'Essonne, Laboratoire Statistique et Génome, UMR CNRS 8071, USC INRA, 91037 Évry, France
| | - Burkhard Morgenstern
- Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, 37073 Göttingen, Germany and Université d'Évry Val d'Essonne, Laboratoire Statistique et Génome, UMR CNRS 8071, USC INRA, 91037 Évry, FranceDepartment of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, 37073 Göttingen, Germany and Université d'Évry Val d'Essonne, Laboratoire Statistique et Génome, UMR CNRS 8071, USC INRA, 91037 Évry, France
| |
Collapse
|
50
|
Abstract
Phylogenetics and population genetics are central disciplines in evolutionary biology. Both are based on comparative data, today usually DNA sequences. These have become so plentiful that alignment-free sequence comparison is of growing importance in the race between scientists and sequencing machines. In phylogenetics, efficient distance computation is the major contribution of alignment-free methods. A distance measure should reflect the number of substitutions per site, which underlies classical alignment-based phylogeny reconstruction. Alignment-free distance measures are either based on word counts or on match lengths, and I apply examples of both approaches to simulated and real data to assess their accuracy and efficiency. While phylogeny reconstruction is based on the number of substitutions, in population genetics, the distribution of mutations along a sequence is also considered. This distribution can be explored by match lengths, thus opening the prospect of alignment-free population genomics.
Collapse
|