1
|
Karami M, Soltani Mohammadi A, Martin M, Ekim B, Shen W, Guo L, Xu M, Pibiri GE, Patro R, Sahlin K. Designing efficient randstrobes for sequence similarity analyses. Bioinformatics 2024; 40:btae187. [PMID: 38579261 PMCID: PMC11034988 DOI: 10.1093/bioinformatics/btae187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Revised: 02/23/2024] [Accepted: 04/04/2024] [Indexed: 04/07/2024] Open
Abstract
MOTIVATION Substrings of length k, commonly referred to as k-mers, play a vital role in sequence analysis. However, k-mers are limited to exact matches between sequences leading to alternative constructs. We recently introduced a class of new constructs, strobemers, that can match across substitutions and smaller insertions and deletions. Randstrobes, the most sensitive strobemer proposed in Sahlin (Effective sequence similarity detection with strobemers. Genome Res 2021a;31:2080-94. https://doi.org/10.1101/gr.275648.121), has been used in several bioinformatics applications such as read classification, short-read mapping, and read overlap detection. Recently, we showed that the more pseudo-random the behavior of the construction (measured in entropy), the more efficient the seeds for sequence similarity analysis. The level of pseudo-randomness depends on the construction operators, but no study has investigated the efficacy. RESULTS In this study, we introduce novel construction methods, including a Binary Search Tree-based approach that improves time complexity over previous methods. To our knowledge, we are also the first to address biases in construction and design three metrics for measuring bias. Our evaluation shows that our methods have favorable speed and sampling uniformity compared to existing approaches. Lastly, guided by our results, we change the seed construction in strobealign, a short-read mapper, and find that the results change substantially. We suggest combining the two results to improve strobealign's accuracy for the shortest reads in our evaluated datasets. Our evaluation highlights sampling biases that can occur and provides guidance on which operators to use when implementing randstrobes. AVAILABILITY AND IMPLEMENTATION All methods and evaluation benchmarks are available in a public Github repository at https://github.com/Moein-Karami/RandStrobes. The scripts for running the strobealign analysis are found at https://github.com/NBISweden/strobealign-evaluation.
Collapse
Affiliation(s)
- Moein Karami
- Department of Mathematics, Science for Life Laboratory, Stockholm University, Stockholm 106 91, Sweden
| | - Aryan Soltani Mohammadi
- Department of Mathematics, Science for Life Laboratory, Stockholm University, Stockholm 106 91, Sweden
| | - Marcel Martin
- Department of Biochemistry and Biophysics, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Stockholm University, Solna SE-17121, Sweden
| | - Barış Ekim
- Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, United States
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States
| | - Wei Shen
- Department of Infectious Diseases, Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Institute for Viral Hepatitis, The Second Affiliated Hospital of Chongqing Medical University, Chongqing 400010, China
| | | | | | - Giulio Ermanno Pibiri
- Department of Environmental Sciences, Informatics and Statistics, Ca’ Foscari University of Venice, Venice 30172, Italy
- ISTI-CNR, Pisa 56124, Italy
| | - Rob Patro
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, United States
| | - Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, Stockholm 106 91, Sweden
| |
Collapse
|
2
|
Ekim B, Berger B, Chikhi R. Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer. Cell Syst 2021; 12:958-968.e6. [PMID: 34525345 PMCID: PMC8562525 DOI: 10.1016/j.cels.2021.08.009] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2021] [Revised: 08/01/2021] [Accepted: 08/19/2021] [Indexed: 10/20/2022]
Abstract
DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. Here, we define an algorithmic approach, mdBG, that makes use of minimizer-space de Bruijn graphs to enable long-read genome assembly. mdBG achieves orders-of-magnitude improvement in both speed and memory usage over existing methods without compromising accuracy. A human genome is assembled in under 10 min using 8 cores and 10 GB RAM, and 60 Gbp of metagenome reads are assembled in 4 min using 1 GB RAM. In addition, we constructed a minimizer-space de Bruijn graph-based representation of 661,405 bacterial genomes, comprising 16 million nodes and 45 million edges, and successfully search it for anti-microbial resistance (AMR) genes in 12 min. We expect our advances to be essential to sequence analysis, given the rise of long-read sequencing in genomics, metagenomics, and pangenomics. Code for constructing mdBGs is freely available for download at https://github.com/ekimb/rust-mdbg/.
Collapse
Affiliation(s)
- Barış Ekim
- Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, USA; Department of Mathematics, Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, USA
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, USA; Department of Mathematics, Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, USA.
| | - Rayan Chikhi
- Department of Computational Biology, Institut Pasteur, Paris 75015, France.
| |
Collapse
|