1
|
Bussi Y, Kapon R, Reich Z. Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. PLoS One 2021; 16:e0258693. [PMID: 34648558 PMCID: PMC8516232 DOI: 10.1371/journal.pone.0258693] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Accepted: 10/02/2021] [Indexed: 12/24/2022] Open
Abstract
Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database. In subsequent analyses on four k-mer lengths spanning the relevant range (11, 21, 31, 41), hierarchical clustering of 1634 genus-level representative genomes using pairwise 21- and 31-mer Jaccard similarities best recapitulated a phylogenetic/taxonomic tree of life with clear boundaries for superkingdom domains and high subtree similarity for named taxons at lower levels (family through phylum). By analyzing ~14.2M prokaryotic genome comparisons by their lowest-common-ancestor taxon levels, we detected many potential misclassification errors in a curated database, further demonstrating the need for wide-scale adoption of quantitative taxonomic classifications based on whole-genome similarity.
Collapse
Affiliation(s)
- Yuval Bussi
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel
- Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel
- Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot, Israel
| | - Ruti Kapon
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel
| | - Ziv Reich
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel
- * E-mail:
| |
Collapse
|
2
|
Mahato NK, Gupta V, Singh P, Kumari R, Verma H, Tripathi C, Rani P, Sharma A, Singhvi N, Sood U, Hira P, Kohli P, Nayyar N, Puri A, Bajaj A, Kumar R, Negi V, Talwar C, Khurana H, Nagar S, Sharma M, Mishra H, Singh AK, Dhingra G, Negi RK, Shakarad M, Singh Y, Lal R. Microbial taxonomy in the era of OMICS: application of DNA sequences, computational tools and techniques. Antonie van Leeuwenhoek 2017; 110:1357-1371. [PMID: 28831610 DOI: 10.1007/s10482-017-0928-1] [Citation(s) in RCA: 39] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/17/2017] [Accepted: 08/10/2017] [Indexed: 02/06/2023]
Abstract
The current prokaryotic taxonomy classifies phenotypically and genotypically diverse microorganisms using a polyphasic approach. With advances in the next-generation sequencing technologies and computational tools for analysis of genomes, the traditional polyphasic method is complemented with genomic data to delineate and classify bacterial genera and species as an alternative to cumbersome and error-prone laboratory tests. This review discusses the applications of sequence-based tools and techniques for bacterial classification and provides a scheme for more robust and reproducible bacterial classification based on genomic data. The present review highlights promising tools and techniques such as ortho-Average Nucleotide Identity, Genome to Genome Distance Calculator and Multi Locus Sequence Analysis, which can be validly employed for characterizing novel microorganisms and assessing phylogenetic relationships. In addition, the review discusses the possibility of employing metagenomic data to assess the phylogenetic associations of uncultured microorganisms. Through this article, we present a review of genomic approaches that can be included in the scheme of taxonomy of bacteria and archaea based on computational and in silico advances to boost the credibility of taxonomic classification in this genomic era.
Collapse
Affiliation(s)
| | - Vipin Gupta
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | - Priya Singh
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | - Rashmi Kumari
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | | | - Charu Tripathi
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | - Pooja Rani
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | - Anukriti Sharma
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | - Nirjara Singhvi
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | - Utkarsh Sood
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | - Princy Hira
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | - Puneet Kohli
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | - Namita Nayyar
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | - Akshita Puri
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | - Abhay Bajaj
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | - Roshan Kumar
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | - Vivek Negi
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | - Chandni Talwar
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | - Himani Khurana
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | - Shekhar Nagar
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | - Monika Sharma
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | - Harshita Mishra
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | - Amit Kumar Singh
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | - Gauri Dhingra
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | - Ram Krishan Negi
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | | | - Yogendra Singh
- Department of Zoology, University of Delhi, Delhi, 110007, India
| | - Rup Lal
- Department of Zoology, University of Delhi, Delhi, 110007, India.
| |
Collapse
|
3
|
Sievers A, Bosiek K, Bisch M, Dreessen C, Riedel J, Froß P, Hausmann M, Hildenbrand G. K-mer Content, Correlation, and Position Analysis of Genome DNA Sequences for the Identification of Function and Evolutionary Features. Genes (Basel) 2017; 8:E122. [PMID: 28422050 PMCID: PMC5406869 DOI: 10.3390/genes8040122] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2017] [Revised: 03/24/2017] [Accepted: 04/04/2017] [Indexed: 12/26/2022] Open
Abstract
In genome analysis, k-mer-based comparison methods have become standard tools. However, even though they are able to deliver reliable results, other algorithms seem to work better in some cases. To improve k-mer-based DNA sequence analysis and comparison, we successfully checked whether adding positional resolution is beneficial for finding and/or comparing interesting organizational structures. A simple but efficient algorithm for extracting and saving local k-mer spectra (frequency distribution of k-mers) was developed and used. The results were analyzed by including positional information based on visualizations as genomic maps and by applying basic vector correlation methods. This analysis was concentrated on small word lengths (1 ≤ k ≤ 4) on relatively small viral genomes of Papillomaviridae and Herpesviridae, while also checking its usability for larger sequences, namely human chromosome 2 and the homologous chromosomes (2A, 2B) of a chimpanzee. Using this alignment-free analysis, several regions with specific characteristics in Papillomaviridae and Herpesviridae formerly identified by independent, mostly alignment-based methods, were confirmed. Correlations between the k-mer content and several genes in these genomes have been found, showing similarities between classified and unclassified viruses, which may be potentially useful for further taxonomic research. Furthermore, unknown k-mer correlations in the genomes of Human Herpesviruses (HHVs), which are probably of major biological function, are found and described. Using the chromosomes of a chimpanzee and human that are currently known, identities between the species on every analyzed chromosome were reproduced. This demonstrates the feasibility of our approach for large data sets of complex genomes. Based on these results, we suggest k-mer analysis with positional resolution as a method for closing a gap between the effectiveness of alignment-based methods (like NCBI BLAST) and the high pace of standard k-mer analysis.
Collapse
Affiliation(s)
- Aaron Sievers
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
| | - Katharina Bosiek
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
| | - Marc Bisch
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
| | - Chris Dreessen
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
| | - Jascha Riedel
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
| | - Patrick Froß
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
| | - Michael Hausmann
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
| | - Georg Hildenbrand
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
- Department of Radiation Oncology, Universitätsmedizin Mannheim, Medical Faculty Mannheim, Heidelberg University, Theodor-Kutzer-Ufer 1-3, 68167 Mannheim, Germany.
| |
Collapse
|
4
|
Dubinkina VB, Ischenko DS, Ulyantsev VI, Tyakht AV, Alexeev DG. Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis. BMC Bioinformatics 2016; 17:38. [PMID: 26774270 PMCID: PMC4715287 DOI: 10.1186/s12859-015-0875-7] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2015] [Accepted: 12/14/2015] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND A rapidly increasing flow of genomic data requires the development of efficient methods for obtaining its compact representation. Feature extraction facilitates classification, clustering and model analysis for testing and refining biological hypotheses. "Shotgun" metagenome is an analytically challenging type of genomic data - containing sequences of all genes from the totality of a complex microbial community. Recently, researchers started to analyze metagenomes using reference-free methods based on the analysis of oligonucleotides (k-mers) frequency spectrum previously applied to isolated genomes. However, little is known about their correlation with the existing approaches for metagenomic feature extraction, as well as the limits of applicability. Here we evaluated a metagenomic pairwise dissimilarity measure based on short k-mer spectrum using the example of human gut microbiota, a biomedically significant object of study. RESULTS We developed a method for calculating pairwise dissimilarity (beta-diversity) of "shotgun" metagenomes based on short k-mer spectra (5 ≤ k ≤ 11). The method was validated on simulated metagenomes and further applied to a large collection of human gut metagenomes from the populations of the world (n=281). The k-mer spectrum-based measure was found to behave similarly to one based on mapping to a reference gene catalog, but different from one using a genome catalog. This difference turned out to be associated with a significant presence of viral reads in a number of metagenomes. Simulations showed limited impact of bacterial genetic variability as well as sequencing errors on k-mer spectra. Specific differences between the datasets from individual populations were identified. CONCLUSIONS Our approach allows rapid estimation of pairwise dissimilarity between metagenomes. Though we applied this technique to gut microbiota, it should be useful for arbitrary metagenomes, even metagenomes with novel microbiota. Dissimilarity measure based on k-mer spectrum provides a wider perspective in comparison with the ones based on the alignment against reference sequence sets. It helps not to miss possible outstanding features of metagenomic composition, particularly related to the presence of an unknown bacteria, virus or eukaryote, as well as to technical artifacts (sample contamination, reads of non-biological origin, etc.) at the early stages of bioinformatic analysis. Our method is complementary to reference-based approaches and can be easily integrated into metagenomic analysis pipelines.
Collapse
Affiliation(s)
- Veronika B Dubinkina
- Research Institute of Physico-Chemical Medicine, Malaya Pirogovskaya, Moscow, 119435, Russia. .,Moscow Institute of Physics and Technology (State University), Institutskiy per., Dolgoprudny, 141700, Russia.
| | - Dmitry S Ischenko
- Research Institute of Physico-Chemical Medicine, Malaya Pirogovskaya, Moscow, 119435, Russia. .,Moscow Institute of Physics and Technology (State University), Institutskiy per., Dolgoprudny, 141700, Russia.
| | | | - Alexander V Tyakht
- Research Institute of Physico-Chemical Medicine, Malaya Pirogovskaya, Moscow, 119435, Russia. .,Moscow Institute of Physics and Technology (State University), Institutskiy per., Dolgoprudny, 141700, Russia.
| | - Dmitry G Alexeev
- Research Institute of Physico-Chemical Medicine, Malaya Pirogovskaya, Moscow, 119435, Russia. .,Moscow Institute of Physics and Technology (State University), Institutskiy per., Dolgoprudny, 141700, Russia.
| |
Collapse
|