1
|
Ndiaye M, Prieto-Baños S, Fitzgerald LM, Yazdizadeh Kharrazi A, Oreshkov S, Dessimoz C, Sedlazeck FJ, Glover N, Majidian S. When less is more: sketching with minimizers in genomics. Genome Biol 2024; 25:270. [PMID: 39402664 PMCID: PMC11472564 DOI: 10.1186/s13059-024-03414-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Accepted: 10/01/2024] [Indexed: 10/19/2024] Open
Abstract
The exponential increase in sequencing data calls for conceptual and computational advances to extract useful biological insights. One such advance, minimizers, allows for reducing the quantity of data handled while maintaining some of its key properties. We provide a basic introduction to minimizers, cover recent methodological developments, and review the diverse applications of minimizers to analyze genomic data, including de novo genome assembly, metagenomics, read alignment, read correction, and pangenomes. We also touch on alternative data sketching techniques including universal hitting sets, syncmers, or strobemers. Minimizers and their alternatives have rapidly become indispensable tools for handling vast amounts of data.
Collapse
Affiliation(s)
- Malick Ndiaye
- Department of Fundamental Microbiology, UNIL, Lausanne, Switzerland
| | - Silvia Prieto-Baños
- Department of Computational Biology, UNIL, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | | | | | - Sergey Oreshkov
- Department of Endocrinology, Diabetology, Metabolism, CHUV, Lausanne, Switzerland
| | - Christophe Dessimoz
- Department of Computational Biology, UNIL, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | | | - Natasha Glover
- Department of Computational Biology, UNIL, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Sina Majidian
- Department of Computational Biology, UNIL, Lausanne, Switzerland.
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland.
| |
Collapse
|
2
|
Rossignolo E, Comin M. Enhanced Compression of k-Mer Sets with Counters via de Bruijn Graphs. J Comput Biol 2024; 31:524-538. [PMID: 38820168 DOI: 10.1089/cmb.2024.0530] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2024] Open
Abstract
An essential task in computational genomics involves transforming input sequences into their constituent k-mers. The quest for an efficient representation of k-mer sets is crucial for enhancing the scalability of bioinformatic analyses. One widely used method involves converting the k-mer set into a de Bruijn graph (dBG), followed by seeking a compact graph representation via the smallest path cover. This study introduces USTAR* (Unitig STitch Advanced constRuction), a tool designed to compress both a set of k-mers and their associated counts. USTAR leverages the connectivity and density of dBGs, enabling a more efficient path selection for constructing the path cover. The efficacy of USTAR is demonstrated through its application in compressing real read data sets. USTAR improves the compression achieved by UST (Unitig STitch), the best algorithm, by percentages ranging from 2.3% to 26.4%, depending on the k-mer size, and it is up to 7 × times faster.
Collapse
Affiliation(s)
- Enrico Rossignolo
- Department of Information Engineering, University of Padua, Padua, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padua, Padua, Italy
| |
Collapse
|
3
|
Podda M, Bonechi S, Palladino A, Scaramuzzino M, Brozzi A, Roma G, Muzzi A, Priami C, Sîrbu A, Bodini M. Classification of Neisseria meningitidis genomes with a bag-of-words approach and machine learning. iScience 2024; 27:109257. [PMID: 38439962 PMCID: PMC10910294 DOI: 10.1016/j.isci.2024.109257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Revised: 12/13/2023] [Accepted: 02/13/2024] [Indexed: 03/06/2024] Open
Abstract
Whole genome sequencing of bacteria is important to enable strain classification. Using entire genomes as an input to machine learning (ML) models would allow rapid classification of strains while using information from multiple genetic elements. We developed a "bag-of-words" approach to encode, using SentencePiece or k-mer tokenization, entire bacterial genomes and analyze these with ML. Initial model selection identified SentencePiece with 8,000 and 32,000 words as the best approach for genome tokenization. We then classified in Neisseria meningitidis genomes the capsule B group genotype with 99.6% accuracy and the multifactor invasive phenotype with 90.2% accuracy, in an independent test set. Subsequently, in silico knockouts of 2,808 genes confirmed that the ML model predictions aligned with our current understanding of the underlying biology. To our knowledge, this is the first ML method using entire bacterial genomes to classify strains and identify genes considered relevant by the classifier.
Collapse
Affiliation(s)
- Marco Podda
- Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy
| | - Simone Bonechi
- Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy
- Department of Computer Science, University of Pisa, 56127 Pisa, Italy
| | - Andrea Palladino
- Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy
| | | | - Alessandro Brozzi
- Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy
| | - Guglielmo Roma
- Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy
| | - Alessandro Muzzi
- Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy
| | - Corrado Priami
- Department of Computer Science, University of Pisa, 56127 Pisa, Italy
| | - Alina Sîrbu
- Department of Computer Science, University of Pisa, 56127 Pisa, Italy
| | - Margherita Bodini
- Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy
| |
Collapse
|
4
|
Gemler BT, Mukherjee C, Howland C, Fullerton PA, Spurbeck RR, Catlin LA, Smith A, Minard-Smith AT, Bartling C. UltraSEQ, a Universal Bioinformatic Platform for Information-Based Clinical Metagenomics and Beyond. Microbiol Spectr 2023; 11:e0416022. [PMID: 37039637 PMCID: PMC10269449 DOI: 10.1128/spectrum.04160-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Accepted: 03/12/2023] [Indexed: 04/12/2023] Open
Abstract
Applied metagenomics is a powerful emerging capability enabling the untargeted detection of pathogens, and its application in clinical diagnostics promises to alleviate the limitations of current targeted assays. While metagenomics offers a hypothesis-free approach to identify any pathogen, including unculturable and potentially novel pathogens, its application in clinical diagnostics has so far been limited by workflow-specific requirements, computational constraints, and lengthy expert review requirements. To address these challenges, we developed UltraSEQ, a first-of-its-kind accurate and scalable metagenomic bioinformatic tool for potential clinical diagnostics and biosurveillance utility. Here, we present the results of the evaluation of our novel UltraSEQ pipeline using an in silico-synthesized metagenome, mock microbial community data sets, and publicly available clinical data sets from samples of different infection types, including both short-read and long-read sequencing data. Our results show that UltraSEQ successfully detected all expected species across the tree of life in the in silico sample and detected all 10 bacterial and fungal species in the mock microbial community data set. For clinical data sets, even without requiring data set-specific configuration setting changes, background sample subtraction, or prior sample information, UltraSEQ achieved an overall accuracy of 91%. Furthermore, as an initial demonstration with a limited patient sample set, we show UltraSEQ's ability to provide antibiotic resistance and virulence factor genotypes that are consistent with phenotypic results. Taken together, the above-described results demonstrate that the UltraSEQ platform offers a transformative approach for microbial and metagenomic sample characterization, employing a biologically informed detection logic, deep metadata, and a flexible system architecture for the classification and characterization of taxonomic origin, gene function, and user-defined functions, including disease-causing infections. IMPORTANCE Traditional clinical microbiology-based diagnostic tests rely on targeted methods that can detect only one to a few preselected organisms or slow, culture-based methods. Although widely used today, these methods have several limitations, resulting in rates of cases of an unknown etiology of infection of >50% for several disease types. Massive developments in sequencing technologies have made it possible to apply metagenomic methods to clinical diagnostics, but current offerings are limited to a specific disease type or sequencer workflow and/or require laboratory-specific controls. The limitations associated with current clinical metagenomic offerings result from the fact that the backend bioinformatic pipelines are optimized for the specific parameters described above, resulting in an excess of unmaintained, redundant, and niche tools that lack standardization and explainable outputs. In this paper, we demonstrate that UltraSEQ uses a novel, information-based approach that enables accurate, evidence-based predictions for diagnosis as well as the functional characterization of a sample.
Collapse
|
5
|
Cavattoni M, Comin M. ClassGraph: Improving Metagenomic Read Classification with Overlap Graphs. J Comput Biol 2023. [PMID: 37023405 DOI: 10.1089/cmb.2022.0208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/08/2023] Open
Abstract
ABSTRACT Current technologies allow the sequencing of microbial communities directly from the environment without prior culturing. One of the major problems when analyzing a microbial sample is to taxonomically annotate its reads to identify the species it contains. Most methods that are currently available focus on the classification of reads using a set of reference genomes and their k-mers. While in terms of precision these methods have reached percentages of correctness close to perfection, in terms of sensitivity (the actual number of classified reads), the performance is often poor. One reason is that the reads in a sample can be very different from the corresponding reference genomes; for example, viral genomes are usually highly mutated. To address this issue, in this article, we propose ClassGraph, a new taxonomic classification method that makes use of the read overlap graph and applies a label propagation algorithm to refine the results of existing tools. We evaluated its performance on simulated and real datasets with several taxonomic classification tools, and the results showed an improved sensitivity and F-measure, while maintaining high precision. ClassGraph is capable of improving the classification accuracy, especially in difficult cases such as virus and real datasets, where traditional tools can classify <40% of reads.
Collapse
Affiliation(s)
| | - Matteo Comin
- Department of Information Engineering, University of Padova, Padova, Italy
| |
Collapse
|
6
|
Ali S, Bello B, Chourasia P, Punathil RT, Zhou Y, Patterson M. PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences. BIOLOGY 2022; 11:418. [PMID: 35336792 PMCID: PMC8945605 DOI: 10.3390/biology11030418] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 02/24/2022] [Accepted: 03/07/2022] [Indexed: 01/14/2023]
Abstract
The study of host specificity has important connections to the question about the origin of SARS-CoV-2 in humans which led to the COVID-19 pandemic-an important open question. There are speculations that bats are a possible origin. Likewise, there are many closely related (corona)viruses, such as SARS, which was found to be transmitted through civets. The study of the different hosts which can be potential carriers and transmitters of deadly viruses to humans is crucial to understanding, mitigating, and preventing current and future pandemics. In coronaviruses, the surface (S) protein, or spike protein, is important in determining host specificity, since it is the point of contact between the virus and the host cell membrane. In this paper, we classify the hosts of over five thousand coronaviruses from their spike protein sequences, segregating them into clusters of distinct hosts among birds, bats, camels, swine, humans, and weasels, to name a few. We propose a feature embedding based on the well-known position weight matrix (PWM), which we call PWM2Vec, and we use it to generate feature vectors from the spike protein sequences of these coronaviruses. While our embedding is inspired by the success of PWMs in biological applications, such as determining protein function and identifying transcription factor binding sites, we are the first (to the best of our knowledge) to use PWMs from viral sequences to generate fixed-length feature vector representations, and use them in the context of host classification. The results on real world data show that when using PWM2Vec, machine learning classifiers are able to perform comparably to the baseline models in terms of predictive performance and runtime-in some cases, the performance is better. We also measure the importance of different amino acids using information gain to show the amino acids which are important for predicting the host of a given coronavirus. Finally, we perform some statistical analyses on these results to show that our embedding is more compact than the embeddings of the baseline models.
Collapse
Affiliation(s)
| | | | | | | | | | - Murray Patterson
- Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA; (S.A.); (B.B.); (P.C.); (R.T.P.); (Y.Z.)
| |
Collapse
|