1
|
Firtina C, Soysal M, Lindegger J, Mutlu O. RawHash2: mapping raw nanopore signals using hash-based seeding and adaptive quantization. Bioinformatics 2024; 40:btae478. [PMID: 39078113 PMCID: PMC11333567 DOI: 10.1093/bioinformatics/btae478] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 07/04/2024] [Accepted: 07/29/2024] [Indexed: 07/31/2024] Open
Abstract
SUMMARY Raw nanopore signals can be analyzed while they are being generated, a process known as real-time analysis. Real-time analysis of raw signals is essential to utilize the unique features that nanopore sequencing provides, enabling the early stopping of the sequencing of a read or the entire sequencing run based on the analysis. The state-of-the-art mechanism, RawHash, offers the first hash-based efficient and accurate similarity identification between raw signals and a reference genome by quickly matching their hash values. In this work, we introduce RawHash2, which provides major improvements over RawHash, including more sensitive quantization and chaining algorithms, weighted mapping decisions, frequency filters to reduce ambiguous seed hits, minimizers for hash-based sketching, and support for the R10.4 flow cell version and POD5 and SLOW5 file formats. Compared to RawHash, RawHash2 provides better F1 accuracy (on average by 10.57% and up to 20.25%) and better throughput (on average by 4.0× and up to 9.9×) than RawHash. AVAILABILITY AND IMPLEMENTATION RawHash2 is available at https://github.com/CMU-SAFARI/RawHash. We also provide the scripts to fully reproduce our results on our GitHub page.
Collapse
Affiliation(s)
- Can Firtina
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich 8092, Switzerland
| | - Melina Soysal
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich 8092, Switzerland
| | - Joël Lindegger
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich 8092, Switzerland
| | - Onur Mutlu
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich 8092, Switzerland
| |
Collapse
|
2
|
Gamaarachchi H, Ferguson JM, Samarakoon H, Liyanage K, Deveson IW. Simulation of nanopore sequencing signal data with tunable parameters. Genome Res 2024; 34:778-783. [PMID: 38692839 PMCID: PMC11216307 DOI: 10.1101/gr.278730.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 04/24/2024] [Indexed: 05/03/2024]
Abstract
In silico simulation of high-throughput sequencing data is a technique used widely in the genomics field. However, there is currently a lack of effective tools for creating simulated data from nanopore sequencing devices, which measure DNA or RNA molecules in the form of time-series current signal data. Here, we introduce Squigulator, a fast and simple tool for simulation of realistic nanopore signal data. Squigulator takes a reference genome, a transcriptome, or read sequences, and generates corresponding raw nanopore signal data. This is compatible with basecalling software from Oxford Nanopore Technologies (ONT) and other third-party tools, thereby providing a useful substrate for development, testing, debugging, validation, and optimization at every stage of a nanopore analysis workflow. The user may generate data with preset parameters emulating specific ONT protocols or noise-free "ideal" data, or they may deterministically modify a range of experimental variables and/or noise parameters to shape the data to their needs. We present a brief example of Squigulator's use, creating simulated data to model the degree to which different parameters impact the accuracy of ONT basecalling and downstream variant detection. This analysis reveals new insights into the nature of ONT data and basecalling algorithms. We provide Squigulator as an open-source tool for the nanopore community.
Collapse
Affiliation(s)
- Hasindu Gamaarachchi
- School of Computer Science and Engineering, University of New South Wales, Sydney, New South Wales 2052, Australia;
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales 2010, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, New South Wales 2010, Australia Australia
| | - James M Ferguson
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales 2010, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, New South Wales 2010, Australia Australia
| | - Hiruna Samarakoon
- School of Computer Science and Engineering, University of New South Wales, Sydney, New South Wales 2052, Australia
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales 2010, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, New South Wales 2010, Australia Australia
| | - Kisaru Liyanage
- School of Computer Science and Engineering, University of New South Wales, Sydney, New South Wales 2052, Australia
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales 2010, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, New South Wales 2010, Australia Australia
| | - Ira W Deveson
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales 2010, Australia;
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, New South Wales 2010, Australia Australia
- St Vincent's Clinical School, Faculty of Medicine, University of New South Wales, Sydney, New South Wales 2052, Australia
| |
Collapse
|
3
|
Wong B, Ferguson JM, Do JY, Gamaarachchi H, Deveson IW. Streamlining remote nanopore data access with slow5curl. Gigascience 2024; 13:giae016. [PMID: 38608279 PMCID: PMC11010652 DOI: 10.1093/gigascience/giae016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2023] [Revised: 03/03/2024] [Accepted: 03/18/2024] [Indexed: 04/14/2024] Open
Abstract
BACKGROUND As adoption of nanopore sequencing technology continues to advance, the need to maintain large volumes of raw current signal data for reanalysis with updated algorithms is a growing challenge. Here we introduce slow5curl, a software package designed to streamline nanopore data sharing, accessibility, and reanalysis. RESULTS Slow5curl allows a user to fetch a specified read or group of reads from a raw nanopore dataset stored on a remote server, such as a public data repository, without downloading the entire file. Slow5curl uses an index to quickly fetch specific reads from a large dataset in SLOW5/BLOW5 format and highly parallelized data access requests to maximize download speeds. Using all public nanopore data from the Human Pangenome Reference Consortium (>22 TB), we demonstrate how slow5curl can be used to quickly fetch and reanalyze raw signal reads corresponding to a set of target genes from each individual in large cohort dataset (n = 91), minimizing the time, egress costs, and local storage requirements for their reanalysis. CONCLUSIONS We provide slow5curl as a free, open-source package that will reduce frictions in data sharing for the nanopore community: https://github.com/BonsonW/slow5curl.
Collapse
Affiliation(s)
- Bonson Wong
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, NSW 2010, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children’s Research Institute,Sydney, NSW 2010, Australia
- School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia
| | - James M Ferguson
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, NSW 2010, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children’s Research Institute,Sydney, NSW 2010, Australia
| | - Jessica Y Do
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, NSW 2010, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children’s Research Institute,Sydney, NSW 2010, Australia
- School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia
| | - Hasindu Gamaarachchi
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, NSW 2010, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children’s Research Institute,Sydney, NSW 2010, Australia
- School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia
| | - Ira W Deveson
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, NSW 2010, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children’s Research Institute,Sydney, NSW 2010, Australia
- St Vincent’s Clinical School, Faculty of Medicine, University of New South Wales, Sydney, NSW 2052, Australia
| |
Collapse
|
4
|
Lin Y, Zhang Y, Sun H, Jiang H, Zhao X, Teng X, Lin J, Shu B, Sun H, Liao Y, Zhou J. NanoDeep: a deep learning framework for nanopore adaptive sampling on microbial sequencing. Brief Bioinform 2023; 25:bbad499. [PMID: 38189540 PMCID: PMC10772945 DOI: 10.1093/bib/bbad499] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2023] [Revised: 11/21/2023] [Accepted: 12/11/2023] [Indexed: 01/09/2024] Open
Abstract
Nanopore sequencers can enrich or deplete the targeted DNA molecules in a library by reversing the voltage across individual nanopores. However, it requires substantial computational resources to achieve rapid operations in parallel at read-time sequencing. We present a deep learning framework, NanoDeep, to overcome these limitations by incorporating convolutional neural network and squeeze and excitation. We first showed that the raw squiggle derived from native DNA sequences determines the origin of microbial and human genomes. Then, we demonstrated that NanoDeep successfully classified bacterial reads from the pooled library with human sequence and showed enrichment for bacterial sequence compared with routine nanopore sequencing setting. Further, we showed that NanoDeep improves the sequencing efficiency and preserves the fidelity of bacterial genomes in the mock sample. In addition, NanoDeep performs well in the enrichment of metagenome sequences of gut samples, showing its potential applications in the enrichment of unknown microbiota. Our toolkit is available at https://github.com/lysovosyl/NanoDeep.
Collapse
Affiliation(s)
- Yusen Lin
- Dermatology Hospital, Southern Medical University, Guangzhou, China
| | - Yongjun Zhang
- Dermatology Hospital, Southern Medical University, Guangzhou, China
| | - Hang Sun
- Dermatology Hospital, Southern Medical University, Guangzhou, China
| | - Hang Jiang
- Dermatology Hospital, Southern Medical University, Guangzhou, China
| | - Xing Zhao
- Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Prince of Wales Hospital, Shatin, New Territories, Hong Kong SAR, China
- Department of Chemical Pathology, The Chinese University of Hong Kong, Prince of Wales Hospital, Shatin, New Territories, Hong Kong SAR, China
| | - Xiaojuan Teng
- Dermatology Hospital, Southern Medical University, Guangzhou, China
| | - Jingxia Lin
- Dermatology Hospital, Southern Medical University, Guangzhou, China
| | - Bowen Shu
- Dermatology Hospital, Southern Medical University, Guangzhou, China
| | - Hao Sun
- Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Prince of Wales Hospital, Shatin, New Territories, Hong Kong SAR, China
- Department of Chemical Pathology, The Chinese University of Hong Kong, Prince of Wales Hospital, Shatin, New Territories, Hong Kong SAR, China
| | - Yuhui Liao
- Dermatology Hospital, Southern Medical University, Guangzhou, China
| | - Jiajian Zhou
- Dermatology Hospital, Southern Medical University, Guangzhou, China
| |
Collapse
|
5
|
Liyanage K, Samarakoon H, Parameswaran S, Gamaarachchi H. Efficient end-to-end long-read sequence mapping using minimap2-fpga integrated with hardware accelerated chaining. Sci Rep 2023; 13:20174. [PMID: 37978244 PMCID: PMC10656460 DOI: 10.1038/s41598-023-47354-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Accepted: 11/12/2023] [Indexed: 11/19/2023] Open
Abstract
minimap2 is the gold-standard software for reference-based sequence mapping in third-generation long-read sequencing. While minimap2 is relatively fast, further speedup is desirable, especially when processing a multitude of large datasets. In this work, we present minimap2-fpga, a hardware-accelerated version of minimap2 that speeds up the mapping process by integrating an FPGA kernel optimised for chaining. Integrating the FPGA kernel into minimap2 posed significant challenges that we solved by accurately predicting the processing time on hardware while considering data transfer overheads, mitigating hardware scheduling overheads in a multi-threaded environment, and optimizing memory management for processing large realistic datasets. We demonstrate speed-ups in end-to-end run-time for data from both Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio). minimap2-fpga is up to 79% and 53% faster than minimap2 for [Formula: see text] ONT and [Formula: see text] PacBio datasets respectively, when mapping without base-level alignment. When mapping with base-level alignment, minimap2-fpga is up to 62% and 10% faster than minimap2 for [Formula: see text] ONT and [Formula: see text] PacBio datasets respectively. The accuracy is near-identical to that of original minimap2 for both ONT and PacBio data, when mapping both with and without base-level alignment. minimap2-fpga is supported on Intel FPGA-based systems (evaluations performed on an on-premise system) and Xilinx FPGA-based systems (evaluations performed on a cloud system). We also provide a well-documented library for the FPGA-accelerated chaining kernel to be used by future researchers developing sequence alignment software with limited hardware background.
Collapse
Affiliation(s)
- Kisaru Liyanage
- School of Computer Science and Engineering, University of New South Wales, Sydney, NSW, Australia
- Genomics Pillar, Garvan Institute of Medical Research, Sydney, NSW, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, Melbourne, Australia
| | - Hiruna Samarakoon
- School of Computer Science and Engineering, University of New South Wales, Sydney, NSW, Australia
- Genomics Pillar, Garvan Institute of Medical Research, Sydney, NSW, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, Melbourne, Australia
| | - Sri Parameswaran
- School of Electrical and Information Engineering, University of Sydney, Sydney, NSW, Australia
| | - Hasindu Gamaarachchi
- School of Computer Science and Engineering, University of New South Wales, Sydney, NSW, Australia.
- Genomics Pillar, Garvan Institute of Medical Research, Sydney, NSW, Australia.
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, Melbourne, Australia.
| |
Collapse
|
6
|
Samarakoon H, Ferguson JM, Gamaarachchi H, Deveson IW. Accelerated nanopore basecalling with SLOW5 data format. Bioinformatics 2023; 39:btad352. [PMID: 37252813 PMCID: PMC10261880 DOI: 10.1093/bioinformatics/btad352] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 04/12/2023] [Accepted: 05/29/2023] [Indexed: 06/01/2023] Open
Abstract
MOTIVATION Nanopore sequencing is emerging as a key pillar in the genomic technology landscape but computational constraints limiting its scalability remain to be overcome. The translation of raw current signal data into DNA or RNA sequence reads, known as 'basecalling', is a major friction in any nanopore sequencing workflow. Here, we exploit the advantages of the recently developed signal data format 'SLOW5' to streamline and accelerate nanopore basecalling on high-performance computing (HPC) and cloud environments. RESULTS SLOW5 permits highly efficient sequential data access, eliminating a potential analysis bottleneck. To take advantage of this, we introduce Buttery-eel, an open-source wrapper for Oxford Nanopore's Guppy basecaller that enables SLOW5 data access, resulting in performance improvements that are essential for scalable, affordable basecalling. AVAILABILITY AND IMPLEMENTATION Buttery-eel is available at https://github.com/Psy-Fer/buttery-eel.
Collapse
Affiliation(s)
- Hiruna Samarakoon
- Genomics Pillar, Garvan Institute of Medical Research, Sydney, NSW 2010, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children’s Research Institute, Australia
- School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia
| | - James M Ferguson
- Genomics Pillar, Garvan Institute of Medical Research, Sydney, NSW 2010, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children’s Research Institute, Australia
| | - Hasindu Gamaarachchi
- Genomics Pillar, Garvan Institute of Medical Research, Sydney, NSW 2010, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children’s Research Institute, Australia
- School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia
| | - Ira W Deveson
- Genomics Pillar, Garvan Institute of Medical Research, Sydney, NSW 2010, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children’s Research Institute, Australia
- Faculty of Medicine, University of New South Wales, Sydney, NSW 2052, Australia
| |
Collapse
|
7
|
Shih PJ, Saadat H, Parameswaran S, Gamaarachchi H. Efficient real-time selective genome sequencing on resource-constrained devices. Gigascience 2022; 12:giad046. [PMID: 37395631 PMCID: PMC10316692 DOI: 10.1093/gigascience/giad046] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2022] [Revised: 04/11/2023] [Accepted: 06/02/2023] [Indexed: 07/04/2023] Open
Abstract
BACKGROUND Third-generation nanopore sequencers offer selective sequencing or "Read Until" that allows genomic reads to be analyzed in real time and abandoned halfway if not belonging to a genomic region of "interest." This selective sequencing opens the door to important applications such as rapid and low-cost genetic tests. The latency in analyzing should be as low as possible for selective sequencing to be effective so that unnecessary reads can be rejected as early as possible. However, existing methods that employ a subsequence dynamic time warping (sDTW) algorithm for this problem are too computationally intensive that a massive workstation with dozens of CPU cores still struggles to keep up with the data rate of a mobile phone-sized MinION sequencer. RESULTS In this article, we present Hardware Accelerated Read Until (HARU), a resource-efficient hardware-software codesign-based method that exploits a low-cost and portable heterogeneous multiprocessor system-on-chip platform with on-chip field-programmable gate arrays (FPGA) to accelerate the sDTW-based Read Until algorithm. Experimental results show that HARU on a Xilinx FPGA embedded with a 4-core ARM processor is around 2.5× faster than a highly optimized multithreaded software version (around 85× faster than the existing unoptimized multithreaded software) running on a sophisticated server with a 36-core Intel Xeon processor for a SARS-CoV-2 dataset. The energy consumption of HARU is 2 orders of magnitudes lower than the same application executing on the 36-core server. CONCLUSIONS HARU demonstrates that nanopore selective sequencing is possible on resource-constrained devices through rigorous hardware-software optimizations. The source code for the HARU sDTW module is available as open source at https://github.com/beebdev/HARU, and an example application that uses HARU is at https://github.com/beebdev/sigfish-haru.
Collapse
Affiliation(s)
- Po Jui Shih
- School of Computer Science and Engineering, UNSW Sydney, Sydney, NSW 2052, Australia
| | - Hassaan Saadat
- School of Electrical Engineering and Telecommunications, UNSW Sydney, Sydney, NSW 2052, Australia
| | - Sri Parameswaran
- School of Electrical and Information Engineering, University of Sydney, Sydney, NSW 2006, Australia
| | - Hasindu Gamaarachchi
- School of Computer Science and Engineering, UNSW Sydney, Sydney, NSW 2052, Australia
- Genomics Pillar, Garvan Institute of Medical Research, Sydney, NSW 2010,
Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children’s Research Institute, Sydney 2010, Australia
| |
Collapse
|