1
|
Campanelli A, Pibiri GE, Fan J, Patro R. Where the patterns are: repetition-aware compression for colored de Bruijn graphs ⋆. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.09.602727. [PMID: 39026859 PMCID: PMC11257547 DOI: 10.1101/2024.07.09.602727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/20/2024]
Abstract
We describe lossless compressed data structures for the colored de Bruijn graph (or, c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map from k -mers to their color sets . The color set of a k -mer is the set of all identifiers, or colors , of the references that contain the k -mer. While these maps find countless applications in computational biology (e.g., basic query, reading mapping, abundance estimation, etc.), their memory usage represents a serious challenge for large-scale sequence indexing. Our solutions leverage on the intrinsic repetitiveness of the color sets when indexing large collections of related genomes. Hence, the described algorithms factorize the color sets into patterns that repeat across the entire collection and represent these patterns once, instead of redundantly replicating their representation as would happen if the sets were encoded as atomic lists of integers. Experimental results across a range of datasets and query workloads show that these representations substantially improve over the space effectiveness of the best previous solutions (sometimes, even dramatically, yielding indexes that are smaller by an order of magnitude). Despite the space reduction, these indexes only moderately impact the efficiency of the queries compared to the fastest indexes. Software The implementation of the indexes used for all experiments in this work is written in C++17 and is available at https://github.com/jermp/fulgor .
Collapse
|
2
|
Mustafa H, Karasikov M, Mansouri Ghiasi N, Rätsch G, Kahles A. Label-guided seed-chain-extend alignment on annotated De Bruijn graphs. Bioinformatics 2024; 40:i337-i346. [PMID: 38940164 PMCID: PMC11211850 DOI: 10.1093/bioinformatics/btae226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. RESULTS We introduce a new scoring model, 'multi-label alignment' (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, 'Label Change' incorporates more informative global sample similarity into local scores. To improve connectivity, 'Node Length Change' dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%-66.8% and covering 45.5%-47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. AVAILABILITY AND IMPLEMENTATION The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.
Collapse
Affiliation(s)
- Harun Mustafa
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
| | - Mikhail Karasikov
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
| | - Nika Mansouri Ghiasi
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, 8092, Switzerland
| | - Gunnar Rätsch
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
- ETH AI Center, Zurich, 8092, Switzerland
- Department of Biology, ETH Zurich, Zurich, 8093, Switzerland
- The LOOP Zurich—Medical Research Center, Zurich, 8044, Switzerland
| | - André Kahles
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
- The LOOP Zurich—Medical Research Center, Zurich, 8044, Switzerland
| |
Collapse
|
3
|
Pebblescout is an easy-to-use tool for fast sequence search in petabase-scale nucleotide resources. Nat Methods 2024; 21:938-939. [PMID: 38755323 DOI: 10.1038/s41592-024-02281-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/18/2024]
|
4
|
Shiryev SA, Agarwala R. Indexing and searching petabase-scale nucleotide resources. Nat Methods 2024; 21:994-1002. [PMID: 38755321 PMCID: PMC11166510 DOI: 10.1038/s41592-024-02280-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Accepted: 04/08/2024] [Indexed: 05/18/2024]
Abstract
Searching vast and rapidly growing nucleotide content in resources, such as runs in the Sequence Read Archive and assemblies for whole-genome shotgun sequencing projects in GenBank, is currently impractical for most researchers. Here we present Pebblescout, a tool that navigates such content by providing indexing and search capabilities. Indexing uses dense sampling of the sequences in the resource. Search finds subjects (runs or assemblies) that have short sequence matches to a user query, with well-defined guarantees and ranks them using informativeness of the matches. We illustrate the functionality of Pebblescout by creating eight databases that index over 3.7 petabases. The web service of Pebblescout can be reached at https://pebblescout.ncbi.nlm.nih.gov . We show that for a wide range of query lengths, Pebblescout provides a data-driven way for finding relevant subsets of large nucleotide resources, reducing the effort for downstream analysis substantially. We also show that Pebblescout results compare favorably to MetaGraph and Sourmash.
Collapse
Affiliation(s)
- Sergey A Shiryev
- Department of Health and Human Services, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Richa Agarwala
- Department of Health and Human Services, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
5
|
Rahman A, Dufresne Y, Medvedev P. Compression algorithm for colored de Bruijn graphs. Algorithms Mol Biol 2024; 19:20. [PMID: 38797858 PMCID: PMC11129398 DOI: 10.1186/s13015-024-00254-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 01/24/2024] [Indexed: 05/29/2024] Open
Abstract
A colored de Bruijn graph (also called a set of k-mer sets), is a set of k-mers with every k-mer assigned a set of colors. Colored de Bruijn graphs are used in a variety of applications, including variant calling, genome assembly, and database search. However, their size has posed a scalability challenge to algorithm developers and users. There have been numerous indexing data structures proposed that allow to store the graph compactly while supporting fast query operations. However, disk compression algorithms, which do not need to support queries on the compressed data and can thus be more space-efficient, have received little attention. The dearth of specialized compression tools has been a detriment to tool developers, tool users, and reproducibility efforts. In this paper, we develop a new tool that compresses colored de Bruijn graphs to disk, building on previous ideas for compression of k-mer sets and indexing colored de Bruijn graphs. We test our tool, called ESS-color, on various datasets, including both sequencing data and whole genomes. ESS-color achieves better compression than all evaluated tools and all datasets, with no other tool able to consistently achieve less than 44% space overhead. The software is available at http://github.com/medvedevgroup/ESSColor .
Collapse
Affiliation(s)
- Amatur Rahman
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802, USA.
| | - Yoann Dufresne
- Institut Pasteur, Université Paris Cité, G5 Sequence Bioinformatics, Paris, France
- Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, Paris, 75015, France
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, 16802, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, 16802, USA
| |
Collapse
|
6
|
Fan J, Khan J, Singh NP, Pibiri GE, Patro R. Fulgor: a fast and compact k-mer index for large-scale matching and color queries. Algorithms Mol Biol 2024; 19:3. [PMID: 38254124 PMCID: PMC10810250 DOI: 10.1186/s13015-024-00251-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 01/03/2024] [Indexed: 01/24/2024] Open
Abstract
The problem of sequence identification or matching-determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence-is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe an efficient colored de Bruijn graph index, arising as the combination of a k-mer dictionary with a compressed inverted index. The proposed index takes full advantage of the fact that unitigs in the colored compacted de Bruijn graph are monochromatic (i.e., all k-mers in a unitig have the same set of references of origin, or color). Specifically, the unitigs are kept in the dictionary in color order, thereby allowing for the encoding of the map from k-mers to their colors in as little as 1 + o(1) bits per unitig. Hence, one color per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for integer lists, the index achieves very small space. We implement these methods in a tool called Fulgor, and conduct an extensive experimental analysis to demonstrate the improvement of our tool over previous solutions. For example, compared to Themisto-the strongest competitor in terms of index space vs. query time trade-off-Fulgor requires significantly less space (up to 43% less space for a collection of 150,000 Salmonella enterica genomes), is at least twice as fast for color queries, and is 2-6[Formula: see text] faster to construct.
Collapse
Affiliation(s)
- Jason Fan
- Department of Computer Science, University of Maryland, College Park, MD, 20742, USA
| | - Jamshed Khan
- Department of Computer Science, University of Maryland, College Park, MD, 20742, USA
| | - Noor Pratap Singh
- Department of Computer Science, University of Maryland, College Park, MD, 20742, USA
| | | | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, MD, 20742, USA.
| |
Collapse
|
7
|
Pibiri GE, Fan J, Patro R. Meta-colored compacted de Bruijn graphs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.21.550101. [PMID: 37546988 PMCID: PMC10401949 DOI: 10.1101/2023.07.21.550101] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
MOTIVATION The colored compacted de Bruijn graph (c-dBG) has become a fundamental tool used across several areas of genomics and pangenomics. For example, it has been widely adopted by methods that perform read mapping or alignment, abundance estimation, and subsequent downstream analyses. These applications essentially regard the c-dBG as a map from k-mers to the set of references in which they appear. The c-dBG data structure should retrieve this set -- the color of the k-mer -- efficiently for any given k-mer, while using little memory. To aid retrieval, the colors are stored explicitly in the data structure and take considerable space for large reference collections, even when compressed. Reducing the space of the colors is therefore of utmost importance for large-scale sequence indexing. RESULTS We describe the meta-colored compacted de Bruijn graph (Mac-dBG) -- a new colored de Bruijn graph data structure where colors are represented holistically, i.e., taking into account their redundancy across the whole collection being indexed, rather than individually as atomic integer lists. This allows the factorization and compression of common sub-patterns across colors. While optimizing the space of our data structure is NP-hard, we propose a simple heuristic algorithm that yields practically good solutions. Results show that the Mac-dBG data structure improves substantially over the best previous space/time trade-off, by providing remarkably better compression effectiveness for the same (or better) query efficiency. This improved space/time trade-off is robust across different datasets and query workloads. Code availability: A C++17 implementation of the Mac-dBG is publicly available on GitHub at: https://github.com/jermp/fulgor.
Collapse
|
8
|
Rahman A, Dufresne Y, Medvedev P. Compression Algorithm for Colored de Bruijn Graphs. LIPICS : LEIBNIZ INTERNATIONAL PROCEEDINGS IN INFORMATICS 2023; 273:17. [PMID: 38712341 PMCID: PMC11071130 DOI: 10.4230/lipics.wabi.2023.17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
A colored de Bruijn graph (also called a set of k-mer sets), is a set of k-mers with every k-mer assigned a set of colors. Colored de Bruijn graphs are used in a variety of applications, including variant calling, genome assembly, and database search. However, their size has posed a scalability challenge to algorithm developers and users. There have been numerous indexing data structures proposed that allow to store the graph compactly while supporting fast query operations. However, disk compression algorithms, which do not need to support queries on the compressed data and can thus be more space-efficient, have received little attention. The dearth of specialized compression tools has been a detriment to tool developers, tool users, and reproducibility efforts. In this paper, we develop a new tool that compresses colored de Bruijn graphs to disk, building on previous ideas for compression of k-mer sets and indexing colored de Bruijn graphs. We test our tool, called ESS-color, on various datasets, including both sequencing data and whole genomes. ESS-color achieves better compression than all evaluated tools and all datasets, with no other tool able to consistently achieve less than 44% space overhead.
Collapse
Affiliation(s)
- Amatur Rahman
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
| | - Yoann Dufresne
- Institut Pasteur, Université Paris Cité, G5 Sequence Bioinformatics, Paris, France
- Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, F-75015 Paris, France
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| |
Collapse
|
9
|
Joudaki A, Meterez A, Mustafa H, Groot Koerkamp R, Kahles A, Rätsch G. Aligning distant sequences to graphs using long seed sketches. Genome Res 2023; 33:1208-1217. [PMID: 37072187 PMCID: PMC10538362 DOI: 10.1101/gr.277659.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Accepted: 04/16/2023] [Indexed: 04/20/2023]
Abstract
Sequence-to-graph alignment is crucial for applications such as variant genotyping, read error correction, and genome assembly. We propose a novel seeding approach that relies on long inexact matches rather than short exact matches, and show that it yields a better time-accuracy trade-off in settings with up to a [Formula: see text] mutation rate. We use sketches of a subset of graph nodes, which are more robust to indels, and store them in a k-nearest neighbor index to avoid the curse of dimensionality. Our approach contrasts with existing methods and highlights the important role that sketching into vector space can play in bioinformatics applications. We show that our method scales to graphs with 1 billion nodes and has quasi-logarithmic query time for queries with an edit distance of [Formula: see text] For such queries, longer sketch-based seeds yield a [Formula: see text] increase in recall compared with exact seeds. Our approach can be incorporated into other aligners, providing a novel direction for sequence-to-graph alignment.
Collapse
Affiliation(s)
- Amir Joudaki
- Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland
- University Hospital Zurich, Biomedical Informatics Research, Zurich 8091, Switzerland
| | - Alexandru Meterez
- Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland
| | - Harun Mustafa
- Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland
- University Hospital Zurich, Biomedical Informatics Research, Zurich 8091, Switzerland
- Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | | | - André Kahles
- Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland;
- University Hospital Zurich, Biomedical Informatics Research, Zurich 8091, Switzerland
- Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Gunnar Rätsch
- Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland
- University Hospital Zurich, Biomedical Informatics Research, Zurich 8091, Switzerland
- Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
- ETH AI Center, 8092 Zurich, Switzerland
| |
Collapse
|