1
|
Alser M, Eudine J, Mutlu O. Taming large-scale genomic analyses via sparsified genomics. Nat Commun 2025; 16:876. [PMID: 39837860 PMCID: PMC11751491 DOI: 10.1038/s41467-024-55762-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2023] [Accepted: 12/20/2024] [Indexed: 01/23/2025] Open
Abstract
Searching for similar genomic sequences is an essential and fundamental step in biomedical research. State-of-the-art computational methods performing such comparisons fail to cope with the exponential growth of genomic sequencing data. We introduce the concept of sparsified genomics where we systematically exclude a large number of bases from genomic sequences and enable faster and memory-efficient processing of the sparsified, shorter genomic sequences, while providing comparable accuracy to processing non-sparsified sequences. Sparsified genomics provides benefits to many genomic analyses and has broad applicability. Sparsifying genomic sequences accelerates the state-of-the-art read mapper (minimap2) by 2.57-5.38x, 1.13-2.78x, and 3.52-6.28x using real Illumina, HiFi, and ONT reads, respectively, while providing comparable memory footprint, 2x smaller index size, and more correctly detected variations compared to minimap2. Sparsifying genomic sequences makes containment search through very large genomes and large databases 72.7-75.88x (1.62-1.9x when indexing is preprocessed) faster and 723.3x more storage-efficient than searching through non-sparsified genomic sequences (with CMash and KMC3). Sparsifying genomic sequences enables robust microbiome discovery by providing 54.15-61.88x (1.58-1.71x when indexing is preprocessed) faster and 720x more storage-efficient taxonomic profiling of metagenomic samples over the state-of-the-art tool (Metalign).
Collapse
Affiliation(s)
- Mohammed Alser
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland.
- Department of Computer Science, Georgia State University, Atlanta, GA, USA.
- Department of Clinical Pharmacy, University of Southern California, LA, CA, USA.
| | - Julien Eudine
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland
| | - Onur Mutlu
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland
| |
Collapse
|
2
|
Lu Y, Li M, Gao Z, Ma H, Chong Y, Hong J, Wu J, Wu D, Xi D, Deng W. Advances in Whole Genome Sequencing: Methods, Tools, and Applications in Population Genomics. Int J Mol Sci 2025; 26:372. [PMID: 39796227 PMCID: PMC11719799 DOI: 10.3390/ijms26010372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2024] [Revised: 12/26/2024] [Accepted: 01/02/2025] [Indexed: 01/13/2025] Open
Abstract
With the rapid advancement of high-throughput sequencing technologies, whole genome sequencing (WGS) has emerged as a crucial tool for studying genetic variation and population structure. Utilizing population genomics tools to analyze resequencing data allows for the effective integration of selection signals with population history, precise estimation of effective population size, historical population trends, and structural insights, along with the identification of specific genetic loci and variations. This paper reviews current whole genome sequencing technologies, detailing primary research methods, relevant software, and their advantages and limitations within population genomics. The goal is to examine the application and progress of resequencing technologies in this field and to consider future developments, including deep learning models and machine learning algorithms, which promise to enhance analytical methodologies and drive further advancements in population genomics.
Collapse
Affiliation(s)
- Ying Lu
- Yunnan Provincial Key Laboratory of Animal Nutrition and Feed, Faculty of Animal Science and Technology, Yunnan Agricultural University, Kunming 650201, China; (Y.L.); (M.L.); (Z.G.); (H.M.); (Y.C.); (J.H.); (J.W.); (D.W.)
| | - Mengfei Li
- Yunnan Provincial Key Laboratory of Animal Nutrition and Feed, Faculty of Animal Science and Technology, Yunnan Agricultural University, Kunming 650201, China; (Y.L.); (M.L.); (Z.G.); (H.M.); (Y.C.); (J.H.); (J.W.); (D.W.)
| | - Zhendong Gao
- Yunnan Provincial Key Laboratory of Animal Nutrition and Feed, Faculty of Animal Science and Technology, Yunnan Agricultural University, Kunming 650201, China; (Y.L.); (M.L.); (Z.G.); (H.M.); (Y.C.); (J.H.); (J.W.); (D.W.)
| | - Hongming Ma
- Yunnan Provincial Key Laboratory of Animal Nutrition and Feed, Faculty of Animal Science and Technology, Yunnan Agricultural University, Kunming 650201, China; (Y.L.); (M.L.); (Z.G.); (H.M.); (Y.C.); (J.H.); (J.W.); (D.W.)
| | - Yuqing Chong
- Yunnan Provincial Key Laboratory of Animal Nutrition and Feed, Faculty of Animal Science and Technology, Yunnan Agricultural University, Kunming 650201, China; (Y.L.); (M.L.); (Z.G.); (H.M.); (Y.C.); (J.H.); (J.W.); (D.W.)
| | - Jieyun Hong
- Yunnan Provincial Key Laboratory of Animal Nutrition and Feed, Faculty of Animal Science and Technology, Yunnan Agricultural University, Kunming 650201, China; (Y.L.); (M.L.); (Z.G.); (H.M.); (Y.C.); (J.H.); (J.W.); (D.W.)
| | - Jiao Wu
- Yunnan Provincial Key Laboratory of Animal Nutrition and Feed, Faculty of Animal Science and Technology, Yunnan Agricultural University, Kunming 650201, China; (Y.L.); (M.L.); (Z.G.); (H.M.); (Y.C.); (J.H.); (J.W.); (D.W.)
| | - Dongwang Wu
- Yunnan Provincial Key Laboratory of Animal Nutrition and Feed, Faculty of Animal Science and Technology, Yunnan Agricultural University, Kunming 650201, China; (Y.L.); (M.L.); (Z.G.); (H.M.); (Y.C.); (J.H.); (J.W.); (D.W.)
| | - Dongmei Xi
- Yunnan Provincial Key Laboratory of Animal Nutrition and Feed, Faculty of Animal Science and Technology, Yunnan Agricultural University, Kunming 650201, China; (Y.L.); (M.L.); (Z.G.); (H.M.); (Y.C.); (J.H.); (J.W.); (D.W.)
| | - Weidong Deng
- Yunnan Provincial Key Laboratory of Animal Nutrition and Feed, Faculty of Animal Science and Technology, Yunnan Agricultural University, Kunming 650201, China; (Y.L.); (M.L.); (Z.G.); (H.M.); (Y.C.); (J.H.); (J.W.); (D.W.)
- State Key Laboratory for Conservation and Utilization of Bio-Resource in Yunnan, Kunming 650201, China
| |
Collapse
|
3
|
Alonso-Marín A, Fernandez I, Aguado-Puig Q, Gómez-Luna J, Marco-Sola S, Mutlu O, Moreto M. BIMSA: accelerating long sequence alignment using processing-in-memory. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae631. [PMID: 39432682 DOI: 10.1093/bioinformatics/btae631] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/14/2024] [Revised: 10/02/2024] [Accepted: 10/17/2024] [Indexed: 10/23/2024]
Abstract
MOTIVATION Recent advances in sequencing technologies have stressed the critical role of sequence analysis algorithms and tools in genomics and healthcare research. In particular, sequence alignment is a fundamental building block in many sequence analysis pipelines and is frequently a performance bottleneck both in terms of execution time and memory usage. Classical sequence alignment algorithms are based on dynamic programming and often require quadratic time and memory with respect to the sequence length. As a result, classical sequence alignment algorithms fail to scale with increasing sequence lengths and quickly become memory-bound due to data-movement penalties. RESULTS Processing-In-Memory (PIM) is an emerging architectural paradigm that seeks to accelerate memory-bound algorithms by bringing computation closer to the data to mitigate data-movement penalties. This work presents BIMSA (Bidirectional In-Memory Sequence Alignment), a PIM design and implementation for the state-of-the-art sequence alignment algorithm BiWFA (Bidirectional Wavefront Alignment), incorporating new hardware-aware optimizations for a production-ready PIM architecture (UPMEM). BIMSA supports aligning sequences up to 100K bases, exceeding the limitations of state-of-the-art PIM implementations. First, BIMSA achieves speedups up to 22.24× (11.95× on average) compared to state-of-the-art PIM-enabled implementations of sequence alignment algorithms. Second, achieves speedups up to 5.84× (2.83× on average) compared to the highest-performance multicore CPU implementation of BiWFA. Third, BIMSA exhibits linear scalability with the number of compute units in memory, enabling further performance improvements with upcoming PIM architectures equipped with more compute units and achieving speedups up to 9.56× (4.7× on average). AVAILABILITY AND IMPLEMENTATION Code and documentation are publicly available at https://github.com/AlejandroAMarin/BIMSA.
Collapse
Affiliation(s)
- Alejandro Alonso-Marín
- Department of Computer Sciences, Barcelona Supercomputing Center, Barcelona 08034, Spain
- Department of Computer Science, Universitat Politècnica de Catalunya, Barcelona 08034, Spain
- Department of Electronic Engineering, Universitat Politècnica de Catalunya, Barcelona 08034, Spain
| | - Ivan Fernandez
- Department of Computer Sciences, Barcelona Supercomputing Center, Barcelona 08034, Spain
- Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya, Barcelona 08034, Spain
| | - Quim Aguado-Puig
- Department of Computer Sciences, Barcelona Supercomputing Center, Barcelona 08034, Spain
- Department of Electronic Engineering, Universitat Politècnica de Catalunya, Barcelona 08034, Spain
- Departament d'Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona 08193, Spain
| | - Juan Gómez-Luna
- NVIDIA Switzerland, NVIDIA Research, Zurich 8004, Switzerland
| | - Santiago Marco-Sola
- Department of Computer Sciences, Barcelona Supercomputing Center, Barcelona 08034, Spain
- Department of Computer Science, Universitat Politècnica de Catalunya, Barcelona 08034, Spain
| | - Onur Mutlu
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich 8006, Switzerland
| | - Miquel Moreto
- Department of Computer Sciences, Barcelona Supercomputing Center, Barcelona 08034, Spain
- Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya, Barcelona 08034, Spain
| |
Collapse
|
4
|
Cavlak MB, Singh G, Alser M, Firtina C, Lindegger J, Sadrosadati M, Mansouri Ghiasi N, Alkan C, Mutlu O. TargetCall: eliminating the wasted computation in basecalling via pre-basecalling filtering. Front Genet 2024; 15:1429306. [PMID: 39529848 PMCID: PMC11551021 DOI: 10.3389/fgene.2024.1429306] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Accepted: 09/30/2024] [Indexed: 11/16/2024] Open
Abstract
Basecalling is an essential step in nanopore sequencing analysis where the raw signals of nanopore sequencers are converted into nucleotide sequences, that is, reads. State-of-the-art basecallers use complex deep learning models to achieve high basecalling accuracy. This makes basecalling computationally inefficient and memory-hungry, bottlenecking the entire genome analysis pipeline. However, for many applications, most reads do not match the reference genome of interest (i.e., target reference) and thus are discarded in later steps in the genomics pipeline, wasting the basecalling computation. To overcome this issue, we propose TargetCall, the first pre-basecalling filter to eliminate the wasted computation in basecalling. TargetCall's key idea is to discard reads that will not match the target reference (i.e., off-target reads) prior to basecalling. TargetCall consists of two main components: (1) LightCall, a lightweight neural network basecaller that produces noisy reads, and (2) Similarity Check, which labels each of these noisy reads as on-target or off-target by matching them to the target reference. Our thorough experimental evaluations show that TargetCall 1) improves the end-to-end basecalling runtime performance of the state-of-the-art basecaller by 3.31 × while maintaining high ( 98.88 % ) recall in keeping on-target reads, 2) maintains high accuracy in downstream analysis, and 3) achieves better runtime performance, throughput, recall, precision, and generality than prior works. TargetCall is available at https://github.com/CMU-SAFARI/TargetCall.
Collapse
Affiliation(s)
- Meryem Banu Cavlak
- SAFARI Research Group, Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, Switzerland
| | - Gagandeep Singh
- SAFARI Research Group, Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, Switzerland
| | - Mohammed Alser
- SAFARI Research Group, Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, Switzerland
| | - Can Firtina
- SAFARI Research Group, Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, Switzerland
| | - Joël Lindegger
- SAFARI Research Group, Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, Switzerland
| | - Mohammad Sadrosadati
- SAFARI Research Group, Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, Switzerland
| | - Nika Mansouri Ghiasi
- SAFARI Research Group, Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, Switzerland
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Ankara, Türkiye
| | - Onur Mutlu
- SAFARI Research Group, Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, Switzerland
| |
Collapse
|
5
|
Shivakumar VS, Ahmed OY, Kovaka S, Zakeri M, Langmead B. Sigmoni: classification of nanopore signal with a compressed pangenome index. Bioinformatics 2024; 40:i287-i296. [PMID: 38940135 PMCID: PMC11211819 DOI: 10.1093/bioinformatics/btae213] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
SUMMARY Improvements in nanopore sequencing necessitate efficient classification methods, including pre-filtering and adaptive sampling algorithms that enrich for reads of interest. Signal-based approaches circumvent the computational bottleneck of basecalling. But past methods for signal-based classification do not scale efficiently to large, repetitive references like pangenomes, limiting their utility to partial references or individual genomes. We introduce Sigmoni: a rapid, multiclass classification method based on the r-index that scales to references of hundreds of Gbps. Sigmoni quantizes nanopore signal into a discrete alphabet of picoamp ranges. It performs rapid, approximate matching using matching statistics, classifying reads based on distributions of picoamp matching statistics and co-linearity statistics, all in linear query time without the need for seed-chain-extend. Sigmoni is 10-100× faster than previous methods for adaptive sampling in host depletion experiments with improved accuracy, and can query reads against large microbial or human pangenomes. Sigmoni is the first signal-based tool to scale to a complete human genome and pangenome while remaining fast enough for adaptive sampling applications. AVAILABILITY AND IMPLEMENTATION Sigmoni is implemented in Python, and is available open-source at https://github.com/vshiv18/sigmoni.
Collapse
Affiliation(s)
- Vikram S Shivakumar
- Department of Computer Science, Johns Hopkins University, 3400 North Charles St., Baltimore, MD 21218, United States
| | - Omar Y Ahmed
- Department of Computer Science, Johns Hopkins University, 3400 North Charles St., Baltimore, MD 21218, United States
| | - Sam Kovaka
- Department of Computer Science, Johns Hopkins University, 3400 North Charles St., Baltimore, MD 21218, United States
| | - Mohsen Zakeri
- Department of Computer Science, Johns Hopkins University, 3400 North Charles St., Baltimore, MD 21218, United States
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, 3400 North Charles St., Baltimore, MD 21218, United States
| |
Collapse
|
6
|
Singh G, Alser M, Denolf K, Firtina C, Khodamoradi A, Cavlak MB, Corporaal H, Mutlu O. RUBICON: a framework for designing efficient deep learning-based genomic basecallers. Genome Biol 2024; 25:49. [PMID: 38365730 PMCID: PMC10870431 DOI: 10.1186/s13059-024-03181-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Accepted: 02/02/2024] [Indexed: 02/18/2024] Open
Abstract
Nanopore sequencing generates noisy electrical signals that need to be converted into a standard string of DNA nucleotide bases using a computational step called basecalling. The performance of basecalling has critical implications for all later steps in genome analysis. Therefore, there is a need to reduce the computation and memory cost of basecalling while maintaining accuracy. We present RUBICON, a framework to develop efficient hardware-optimized basecallers. We demonstrate the effectiveness of RUBICON by developing RUBICALL, the first hardware-optimized mixed-precision basecaller that performs efficient basecalling, outperforming the state-of-the-art basecallers. We believe RUBICON offers a promising path to develop future hardware-optimized basecallers.
Collapse
Affiliation(s)
- Gagandeep Singh
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zürich, Switzerland
- Research and Advanced Development, AMD, Longmont, USA
| | - Mohammed Alser
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zürich, Switzerland
| | | | - Can Firtina
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zürich, Switzerland.
| | | | - Meryem Banu Cavlak
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zürich, Switzerland
| | - Henk Corporaal
- Department of Electrical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands
| | - Onur Mutlu
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zürich, Switzerland.
| |
Collapse
|
7
|
Rádai Z, Váradi A, Takács P, Nagy NA, Schmitt N, Prépost E, Kardos G, Laczkó L. An overlooked phenomenon: complex interactions of potential error sources on the quality of bacterial de novo genome assemblies. BMC Genomics 2024; 25:45. [PMID: 38195441 PMCID: PMC10777565 DOI: 10.1186/s12864-023-09910-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Accepted: 12/15/2023] [Indexed: 01/11/2024] Open
Abstract
BACKGROUND Parameters adversely affecting the contiguity and accuracy of the assemblies from Illumina next-generation sequencing (NGS) are well described. However, past studies generally focused on their additive effects, overlooking their potential interactions possibly exacerbating one another's effects in a multiplicative manner. To investigate whether or not they act interactively on de novo genome assembly quality, we simulated sequencing data for 13 bacterial reference genomes, with varying levels of error rate, sequencing depth, PCR and optical duplicate ratios. RESULTS We assessed the quality of assemblies from the simulated sequencing data with a number of contiguity and accuracy metrics, which we used to quantify both additive and multiplicative effects of the four parameters. We found that the tested parameters are engaged in complex interactions, exerting multiplicative, rather than additive, effects on assembly quality. Also, the ratio of non-repeated regions and GC% of the original genomes can shape how the four parameters affect assembly quality. CONCLUSIONS We provide a framework for consideration in future studies using de novo genome assembly of bacterial genomes, e.g. in choosing the optimal sequencing depth, balancing between its positive effect on contiguity and negative effect on accuracy due to its interaction with error rate. Furthermore, the properties of the genomes to be sequenced also should be taken into account, as they might influence the effects of error sources themselves.
Collapse
Affiliation(s)
- Zoltán Rádai
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary.
- Department of Dermatology, University Hospital Düsseldorf, Heinrich-Heine-University, Düsseldorf, Germany.
| | - Alex Váradi
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary
- Department of Laboratory Medicine, Medical School, University of Pécs, Pécs, Hungary
| | - Péter Takács
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary
- Department of Health Informatics, Institute of Health Sciences, Faculty of Health, University of Debrecen, Debrecen, Hungary
| | - Nikoletta Andrea Nagy
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary
- Department of Evolutionary Zoology, ELKH-DE Behavioural Ecology Research Group, University of Debrecen, Debrecen, Hungary
- Department of Evolutionary Zoology and Human Biology, University of Debrecen, Debrecen, Hungary
| | - Nicholas Schmitt
- Department of Dermatology, University Hospital Düsseldorf, Heinrich-Heine-University, Düsseldorf, Germany
| | - Eszter Prépost
- Department of Health Industry, University of Debrecen, Debrecen, Hungary
| | - Gábor Kardos
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary
- Department of Gerontology, Faculty of Health Sciences, University of Debrecen, Debrecen, Hungary
| | - Levente Laczkó
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary
- ELKH-DE Conservation Biology Research Group, Debrecen, Hungary
| |
Collapse
|
8
|
Canessa E. Physics-Based Signal Analysis of Genome Sequences: An Overview of GenomeBits. Microorganisms 2023; 11:2733. [PMID: 38004745 PMCID: PMC10673239 DOI: 10.3390/microorganisms11112733] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Revised: 10/28/2023] [Accepted: 11/07/2023] [Indexed: 11/26/2023] Open
Abstract
A comprehensive overview of the recent physics-inspired genome analysis tool, GenomeBits, is presented. This is based on traditional signal processing methods such as discrete Fourier transform (DFT). GenomeBits can be used to extract underlying genomics features from the distribution of nucleotides, and can be further used to analyze the mutation patterns in viral genomes. Examples of the main GenomeBits findings outlining the intrinsic signal organization of genomics sequences for different SARS-CoV-2 variants along the pandemic years 2020-2022 and Monkeypox cases in 2021 are presented to show the usefulness of GenomeBits. GenomeBits results for DFT of SARS-CoV-2 genomes in different geographical regions are discussed, together with the GenomeBits analysis of complete genome sequences for the first coronavirus variants reported: Alpha, Beta, Gamma, Epsilon and Eta. Interesting features of the Delta and Omicron variants in the form of a unique 'order-disorder' transition are uncovered from these samples, as well as from their cumulative distribution function and scatter plots. This class of transitions might reveal the cumulative outcome of mutations on the spike protein. A salient feature of GenomeBits is the mapping of the nucleotide bases (A,T,C,G) into an alternating spin-like numerical sequence via a series having binary (0,1) indicators for each A,T,C,G. This leads to the derivation of a set of statistical distribution curves. Furthermore, the quantum-based extension of the GenomeBits model to an analogous probability measure is shown to identify properties of genome sequences as wavefunctions via a superposition of states. An association of the integral of the GenomeBits coding and a binding-like energy can, in principle, also be established. The relevance of these different results in bioinformatics is analyzed.
Collapse
Affiliation(s)
- Enrique Canessa
- The Abdus Salam International Centre for Theoretical Physics (ICTP), 34151 Trieste, Italy
| |
Collapse
|
9
|
Shivakumar VS, Ahmed OY, Kovaka S, Zakeri M, Langmead B. Sigmoni: classification of nanopore signal with a compressed pangenome index. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.15.553308. [PMID: 37645873 PMCID: PMC10462034 DOI: 10.1101/2023.08.15.553308] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/31/2023]
Abstract
Improvements in nanopore sequencing necessitate efficient classification methods, including pre-filtering and adaptive sampling algorithms that enrich for reads of interest. Signal-based approaches circumvent the computational bottleneck of basecalling. But past methods for signal-based classification do not scale efficiently to large, repetitive references like pangenomes, limiting their utility to partial references or individual genomes. We introduce Sigmoni: a rapid, multiclass classification method based on the r-index that scales to references of hundreds of Gbps. Sigmoni quantizes nanopore signal into a discrete alphabet of picoamp ranges. It performs rapid, approximate matching using matching statistics, classifying reads based on distributions of picoamp matching statistics and co-linearity statistics. Sigmoni is 10-100× faster than previous methods for adaptive sampling in host depletion experiments with improved accuracy, and can query reads against large microbial or human pangenomes.
Collapse
Affiliation(s)
| | - Omar Y. Ahmed
- Department of Computer Science, Johns Hopkins University
| | - Sam Kovaka
- Department of Computer Science, Johns Hopkins University
| | - Mohsen Zakeri
- Department of Computer Science, Johns Hopkins University
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University
| |
Collapse
|
10
|
Diab S, Nassereldine A, Alser M, Gómez Luna J, Mutlu O, El Hajj I. A framework for high-throughput sequence alignment using real processing-in-memory systems. Bioinformatics 2023; 39:btad155. [PMID: 36971586 PMCID: PMC10159653 DOI: 10.1093/bioinformatics/btad155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Revised: 02/24/2023] [Accepted: 03/25/2023] [Indexed: 03/29/2023] Open
Abstract
MOTIVATION Sequence alignment is a memory bound computation whose performance in modern systems is limited by the memory bandwidth bottleneck. Processing-in-memory (PIM) architectures alleviate this bottleneck by providing the memory with computing competencies. We propose Alignment-in-Memory (AIM), a framework for high-throughput sequence alignment using PIM, and evaluate it on UPMEM, the first publicly available general-purpose programmable PIM system. RESULTS Our evaluation shows that a real PIM system can substantially outperform server-grade multi-threaded CPU systems running at full-scale when performing sequence alignment for a variety of algorithms, read lengths, and edit distance thresholds. We hope that our findings inspire more work on creating and accelerating bioinformatics algorithms for such real PIM systems. AVAILABILITY AND IMPLEMENTATION Our code is available at https://github.com/safaad/aim.
Collapse
Affiliation(s)
- Safaa Diab
- Department of Computer Science, American University of Beirut, Riad El-Solh, Beirut 1107 2020, Lebanon
| | - Amir Nassereldine
- Department of Computer Science, American University of Beirut, Riad El-Solh, Beirut 1107 2020, Lebanon
| | - Mohammed Alser
- Department of Information Technology and Electrical Engineering, ETH Zürich, Gloriastrasse 35, Zürich 8092, Switzerland
| | - Juan Gómez Luna
- Department of Information Technology and Electrical Engineering, ETH Zürich, Gloriastrasse 35, Zürich 8092, Switzerland
| | - Onur Mutlu
- Department of Information Technology and Electrical Engineering, ETH Zürich, Gloriastrasse 35, Zürich 8092, Switzerland
| | - Izzat El Hajj
- Department of Computer Science, American University of Beirut, Riad El-Solh, Beirut 1107 2020, Lebanon
| |
Collapse
|
11
|
Olson ND, Wagner J, Dwarshuis N, Miga KH, Sedlazeck FJ, Salit M, Zook JM. Variant calling and benchmarking in an era of complete human genome sequences. Nat Rev Genet 2023:10.1038/s41576-023-00590-0. [PMID: 37059810 DOI: 10.1038/s41576-023-00590-0] [Citation(s) in RCA: 46] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/22/2023] [Indexed: 04/16/2023]
Abstract
Genetic variant calling from DNA sequencing has enabled understanding of germline variation in hundreds of thousands of humans. Sequencing technologies and variant-calling methods have advanced rapidly, routinely providing reliable variant calls in most of the human genome. We describe how advances in long reads, deep learning, de novo assembly and pangenomes have expanded access to variant calls in increasingly challenging, repetitive genomic regions, including medically relevant regions, and how new benchmark sets and benchmarking methods illuminate their strengths and limitations. Finally, we explore the possible future of more complete characterization of human genome variation in light of the recent completion of a telomere-to-telomere human genome reference assembly and human pangenomes, and we consider the innovations needed to benchmark their newly accessible repetitive regions and complex variants.
Collapse
Affiliation(s)
- Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Nathan Dwarshuis
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Fritz J Sedlazeck
- Baylor College of Medicine, Human Genome Sequencing Center, Houston, TX, USA
| | | | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA.
| |
Collapse
|
12
|
Firtina C, Park J, Alser M, Kim JS, Cali D, Shahroodi T, Ghiasi N, Singh G, Kanellopoulos K, Alkan C, Mutlu O. BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis. NAR Genom Bioinform 2023; 5:lqad004. [PMID: 36685727 PMCID: PMC9853099 DOI: 10.1093/nargab/lqad004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 12/16/2022] [Accepted: 01/10/2023] [Indexed: 01/22/2023] Open
Abstract
Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either (i) increasing the use of the costly sequence alignment or (ii) limited sensitivity. We introduce BLEND, the first efficient and accurate mechanism that can identify both exact-matching and highly similar seeds with a single lookup of their hash values, called fuzzy seed matches. BLEND (i) utilizes a technique called SimHash, that can generate the same hash value for similar sets, and (ii) provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently. We show the benefits of BLEND when used in read overlapping and read mapping. For read overlapping, BLEND is faster by 2.4×-83.9× (on average 19.3×), has a lower memory footprint by 0.9×-14.1× (on average 3.8×), and finds higher quality overlaps leading to accurate de novo assemblies than the state-of-the-art tool, minimap2. For read mapping, BLEND is faster by 0.8×-4.1× (on average 1.7×) than minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND.
Collapse
Affiliation(s)
| | - Jisung Park
- ETH Zurich, Zurich 8092, Switzerland
- POSTECH, Pohang 37673, Republic of Korea
| | | | | | | | | | | | | | | | - Can Alkan
- Bilkent University, Ankara 06800, Turkey
| | | |
Collapse
|