1
|
Yan L, Yin Z, Zhang H, Zhao Z, Wang M, Müller A, Kallenborn F, Wichmann A, Wei Y, Niu B, Schmidt B, Liu W. RabbitQCPlus 2.0: More efficient and versatile quality control for sequencing data. Methods 2023; 216:39-50. [PMID: 37330158 DOI: 10.1016/j.ymeth.2023.06.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Revised: 05/26/2023] [Accepted: 06/12/2023] [Indexed: 06/19/2023] Open
Abstract
Assessing the quality of sequencing data plays a crucial role in downstream data analysis. However, existing tools often achieve sub-optimal efficiency, especially when dealing with compressed files or performing complicated quality control operations such as over-representation analysis and error correction. We present RabbitQCPlus, an ultra-efficient quality control tool for modern multi-core systems. RabbitQCPlus uses vectorization, memory copy reduction, parallel (de)compression, and optimized data structures to achieve substantial performance gains. It is 1.1 to 5.4 times faster when performing basic quality control operations compared to state-of-the-art applications yet requires fewer compute resources. Moreover, RabbitQCPlus is at least 4 times faster than other applications when processing gzip-compressed FASTQ files and 1.3 times faster with the error correction module turned on. Furthermore, it takes less than 4 minutes to process 280 GB of plain FASTQ sequencing data, while other applications take at least 22 minutes on a 48-core server when enabling the per-read over-representation analysis. C++ sources are available at https://github.com/RabbitBio/RabbitQCPlus.
Collapse
Affiliation(s)
- Lifeng Yan
- School of Software, Shandong University, Jinan, China
| | - Zekun Yin
- School of Software, Shandong University, Jinan, China.
| | - Hao Zhang
- School of Software, Shandong University, Jinan, China
| | - Zhan Zhao
- School of Software, Shandong University, Jinan, China
| | - Mingkai Wang
- School of Software, Shandong University, Jinan, China
| | - André Müller
- Institute for Computer Science, Johannes Gutenberg University, Mainz, Germany
| | - Felix Kallenborn
- Institute for Computer Science, Johannes Gutenberg University, Mainz, Germany
| | - Alexander Wichmann
- Institute for Computer Science, Johannes Gutenberg University, Mainz, Germany
| | - Yanjie Wei
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Beifang Niu
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
| | - Bertil Schmidt
- Institute for Computer Science, Johannes Gutenberg University, Mainz, Germany
| | - Weiguo Liu
- School of Software, Shandong University, Jinan, China
| |
Collapse
|
2
|
Diab S, Nassereldine A, Alser M, Gómez Luna J, Mutlu O, El Hajj I. A framework for high-throughput sequence alignment using real processing-in-memory systems. Bioinformatics 2023; 39:btad155. [PMID: 36971586 PMCID: PMC10159653 DOI: 10.1093/bioinformatics/btad155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Revised: 02/24/2023] [Accepted: 03/25/2023] [Indexed: 03/29/2023] Open
Abstract
MOTIVATION Sequence alignment is a memory bound computation whose performance in modern systems is limited by the memory bandwidth bottleneck. Processing-in-memory (PIM) architectures alleviate this bottleneck by providing the memory with computing competencies. We propose Alignment-in-Memory (AIM), a framework for high-throughput sequence alignment using PIM, and evaluate it on UPMEM, the first publicly available general-purpose programmable PIM system. RESULTS Our evaluation shows that a real PIM system can substantially outperform server-grade multi-threaded CPU systems running at full-scale when performing sequence alignment for a variety of algorithms, read lengths, and edit distance thresholds. We hope that our findings inspire more work on creating and accelerating bioinformatics algorithms for such real PIM systems. AVAILABILITY AND IMPLEMENTATION Our code is available at https://github.com/safaad/aim.
Collapse
Affiliation(s)
- Safaa Diab
- Department of Computer Science, American University of Beirut, Riad El-Solh, Beirut 1107 2020, Lebanon
| | - Amir Nassereldine
- Department of Computer Science, American University of Beirut, Riad El-Solh, Beirut 1107 2020, Lebanon
| | - Mohammed Alser
- Department of Information Technology and Electrical Engineering, ETH Zürich, Gloriastrasse 35, Zürich 8092, Switzerland
| | - Juan Gómez Luna
- Department of Information Technology and Electrical Engineering, ETH Zürich, Gloriastrasse 35, Zürich 8092, Switzerland
| | - Onur Mutlu
- Department of Information Technology and Electrical Engineering, ETH Zürich, Gloriastrasse 35, Zürich 8092, Switzerland
| | - Izzat El Hajj
- Department of Computer Science, American University of Beirut, Riad El-Solh, Beirut 1107 2020, Lebanon
| |
Collapse
|
3
|
Firtina C, Park J, Alser M, Kim JS, Cali D, Shahroodi T, Ghiasi N, Singh G, Kanellopoulos K, Alkan C, Mutlu O. BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis. NAR Genom Bioinform 2023; 5:lqad004. [PMID: 36685727 PMCID: PMC9853099 DOI: 10.1093/nargab/lqad004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 12/16/2022] [Accepted: 01/10/2023] [Indexed: 01/22/2023] Open
Abstract
Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either (i) increasing the use of the costly sequence alignment or (ii) limited sensitivity. We introduce BLEND, the first efficient and accurate mechanism that can identify both exact-matching and highly similar seeds with a single lookup of their hash values, called fuzzy seed matches. BLEND (i) utilizes a technique called SimHash, that can generate the same hash value for similar sets, and (ii) provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently. We show the benefits of BLEND when used in read overlapping and read mapping. For read overlapping, BLEND is faster by 2.4×-83.9× (on average 19.3×), has a lower memory footprint by 0.9×-14.1× (on average 3.8×), and finds higher quality overlaps leading to accurate de novo assemblies than the state-of-the-art tool, minimap2. For read mapping, BLEND is faster by 0.8×-4.1× (on average 1.7×) than minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND.
Collapse
Affiliation(s)
| | - Jisung Park
- ETH Zurich, Zurich 8092, Switzerland
- POSTECH, Pohang 37673, Republic of Korea
| | | | | | | | | | | | | | | | - Can Alkan
- Bilkent University, Ankara 06800, Turkey
| | | |
Collapse
|
4
|
Khatamifard SK, Chowdhury Z, Pande N, Razaviyayn M, Kim C, Karpuzcu UR. GeNVoM: Read Mapping Near Non-Volatile Memory. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3482-3496. [PMID: 34613917 DOI: 10.1109/tcbb.2021.3118018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
DNA sequencing is the physical/biochemical process of identifying the location of the four bases (Adenine, Guanine, Cytosine, Thymine) in a DNA strand. As semiconductor technology revolutionized computing, modern DNA sequencing technology (termed Next Generation Sequencing, NGS) revolutionized genomic research. As a result, modern NGS platforms can sequence hundreds of millions of short DNA fragments in parallel. The sequenced DNA fragments, representing the output of NGS platforms, are termed reads. Besides genomic variations, NGS imperfections induce noise in reads. Mapping each read to (the most similar portion of) a reference genome of the same species, i.e., read mapping, is a common critical first step in a diverse set of emerging bioinformatics applications. Mapping represents a search-heavy memory-intensive similarity matching problem, therefore, can greatly benefit from near-memory processing. Intuition suggests using fast associative search enabled by Ternary Content Addressable Memory (TCAM) by construction. However, the excessive energy consumption and lack of support for similarity matching (under NGS and genomic variation induced noise) renders direct application of TCAM infeasible, irrespective of volatility, where only non-volatile TCAM can accommodate the large memory footprint in an area-efficient way. This paper introduces GeNVoM, a scalable, energy-efficient and high-throughput solution. Instead of optimizing an algorithm developed for general-purpose computers or GPUs, GeNVoM rethinks the algorithm and non-volatile TCAM-based accelerator design together from the ground up. Thereby GeNVoM can improve the throughput by up to 3.67×; the energy consumption, by up to 1.36×, when compared to an ASIC baseline, which represents one of the highest-throughput implementations known.
Collapse
|
5
|
Yu C, Zhao Y, Zhao C, Ma H, Wang G. DiagAF: A More Accurate and Efficient Pre-Alignment Filter for Sequence Alignment. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3404-3415. [PMID: 34780330 DOI: 10.1109/tcbb.2021.3127879] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Sequence alignment is an essential step in computational genomics. More accurate and efficient sequence pre-alignment methods that run before conducting expensive computation for final verification are still urgently needed. In this article, we propose a more accurate and efficient pre-alignment algorithm for sequence alignment, called DiagAF. Firstly, DiagAF uses a new lower bound of edit distance based on shift hamming masks. The new lower bound makes use of fewer shift hamming masks comparing with state-of-the-art algorithms such as SHD and MAGNET. Moreover, it takes account the information of edit distance path exchanging on shift hamming masks. Secondly, DiagAF can deal with alignments of sequence pairs with not equal length, rather than state-of-the-art methods just for equal length. Thirdly, DiagAF can align sequences with early termination for true alignments. In the experiment, we compared DiagAF with state-of-the-art methods. DiagAF can achieve a much smaller error rate than them, meanwhile use less time than them. We believe that DiagAF algorithm can further improve the performance of state-of-the-art sequence alignment softwares. The source codes of DiagAF can be downloaded from web site https://github.com/BioLab-cz/DiagAF.
Collapse
|
6
|
Gudur VY, Maheshwari S, Acharyya A, Shafik R. An FPGA Based Energy-Efficient Read Mapper With Parallel Filtering and In-Situ Verification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2697-2711. [PMID: 34415836 DOI: 10.1109/tcbb.2021.3106311] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
In the assembly pipeline of Whole Genome Sequencing (WGS), read mapping is a widely used method to re-assemble the genome. It employs approximate string matching and dynamic programming-based algorithms on a large volume of data and associated structures, making it a computationally intensive process. Currently, the state-of-the-art data centers for genome sequencing incur substantial setup and energy costs for maintaining hardware, data storage and cooling systems. To enable low-cost genomics, we propose an energy-efficient architectural methodology for read mapping using a single system-on-chip (SoC) platform. The proposed methodology is based on the q-gram lemma and designed using a novel architecture for filtering and verification. The filtering algorithm is designed using a parallel sorted q-gram lemma based method for the first time, and it is complemented by an in-situ verification routine using parallel Myers bit-vector algorithm. We have implemented our design on the Zynq Ultrascale+ XCZU9EG MPSoC platform. It is then extensively validated using real genomic data to demonstrate up to 7.8× energy reduction and up to 13.3× less resource utilization when compared with the state-of-the-art software and hardware approaches.
Collapse
|
7
|
Alser M, Lindegger J, Firtina C, Almadhoun N, Mao H, Singh G, Gomez-Luna J, Mutlu O. From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures. Comput Struct Biotechnol J 2022; 20:4579-4599. [PMID: 36090814 PMCID: PMC9436709 DOI: 10.1016/j.csbj.2022.08.019] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Revised: 08/08/2022] [Accepted: 08/08/2022] [Indexed: 02/01/2023] Open
Abstract
We now need more than ever to make genome analysis more intelligent. We need to read, analyze, and interpret our genomes not only quickly, but also accurately and efficiently enough to scale the analysis to population level. There currently exist major computational bottlenecks and inefficiencies throughout the entire genome analysis pipeline, because state-of-the-art genome sequencing technologies are still not able to read a genome in its entirety. We describe the ongoing journey in significantly improving the performance, accuracy, and efficiency of genome analysis using intelligent algorithms and hardware architectures. We explain state-of-the-art algorithmic methods and hardware-based acceleration approaches for each step of the genome analysis pipeline and provide experimental evaluations. Algorithmic approaches exploit the structure of the genome as well as the structure of the underlying hardware. Hardware-based acceleration approaches exploit specialized microarchitectures or various execution paradigms (e.g., processing inside or near memory) along with algorithmic changes, leading to new hardware/software co-designed systems. We conclude with a foreshadowing of future challenges, benefits, and research directions triggered by the development of both very low cost yet highly error prone new sequencing technologies and specialized hardware chips for genomics. We hope that these efforts and the challenges we discuss provide a foundation for future work in making genome analysis more intelligent.
Collapse
Affiliation(s)
| | | | - Can Firtina
- ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland
| | | | - Haiyu Mao
- ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland
| | | | | | - Onur Mutlu
- ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland
| |
Collapse
|
8
|
Kallenborn F, Cascitti J, Schmidt B. CARE 2.0: reducing false-positive sequencing error corrections using machine learning. BMC Bioinformatics 2022; 23:227. [PMID: 35698033 PMCID: PMC9195321 DOI: 10.1186/s12859-022-04754-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Accepted: 05/30/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools. RESULTS We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0's hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data. CONCLUSION False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE .
Collapse
Affiliation(s)
- Felix Kallenborn
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany.
| | - Julian Cascitti
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany
| | - Bertil Schmidt
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany
| |
Collapse
|
9
|
Zuo Z, Tang C, Xu Y, Wang Y, Wu Y, Qi J, Shi X. Gene Position Index Mutation Detection Algorithm Based on Feedback Fast Learning Neural Network. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2021; 2021:1716182. [PMID: 34306047 PMCID: PMC8279879 DOI: 10.1155/2021/1716182] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/09/2021] [Accepted: 06/17/2021] [Indexed: 11/18/2022]
Abstract
In the detection of genome variation, the research on the internal correlation of reference genome is deepening; the detection of variation in genome sequence has become the focus of research, and it has also become an effective path to find new genes and new functional proteins. The targeted sequencing sequence is used to sequence the exon region of a specific gene in cancer gene detection, and the sequencing depth is relatively large. Traditional alignment algorithms will lose some sequences, which will lead to inaccurate mutation detection. This paper proposes a mutation detection algorithm based on feedback fast learning neural network position index. By establishing a position index relationship for ACGT in the DNA sequence, the subsequence is decomposed into the position relationship of different subsequences corresponding to the main sequence. The positional relationship of the subsequence in the main sequence is determined by the positional relationship. Analyzing SNP and InDel mutations, even structural mutations, through the position correlation of sequences has the advantages of high precision and easy implementation by personal computers. The feedback fast learning neural network is used to verify whether there is a linear relationship between two or more positions. Experimental results show that the mutation points detected by position index are more than those detected by Bcftools, Freebye, Vanscan2, and Gatk.
Collapse
Affiliation(s)
- Zhike Zuo
- Chongqing Key Laboratory of Spatial Data Mining and Big Data Integration for Ecology and Environment, Chongqing Finance and Economics College, Chongqing 401320, China
| | - Chao Tang
- Radiation & Cancer Biology Laboratory, Radiation Oncology Center, Chongqing Key Laboratory of Translational Research for Cancer Metastasis and Individualized Treatment, Chongqing University Cancer Hospital & Chongqing Cancer Institute & Chongqing Cancer Hospital, Chongqing 400030, China
| | - Yu Xu
- Radiation & Cancer Biology Laboratory, Radiation Oncology Center, Chongqing Key Laboratory of Translational Research for Cancer Metastasis and Individualized Treatment, Chongqing University Cancer Hospital & Chongqing Cancer Institute & Chongqing Cancer Hospital, Chongqing 400030, China
- College of Bioengineering, Chongqing University, Chongqing, China
| | - Ying Wang
- Radiation & Cancer Biology Laboratory, Radiation Oncology Center, Chongqing Key Laboratory of Translational Research for Cancer Metastasis and Individualized Treatment, Chongqing University Cancer Hospital & Chongqing Cancer Institute & Chongqing Cancer Hospital, Chongqing 400030, China
| | - Yongzhong Wu
- Radiation & Cancer Biology Laboratory, Radiation Oncology Center, Chongqing Key Laboratory of Translational Research for Cancer Metastasis and Individualized Treatment, Chongqing University Cancer Hospital & Chongqing Cancer Institute & Chongqing Cancer Hospital, Chongqing 400030, China
| | - Jun Qi
- Radiation & Cancer Biology Laboratory, Radiation Oncology Center, Chongqing Key Laboratory of Translational Research for Cancer Metastasis and Individualized Treatment, Chongqing University Cancer Hospital & Chongqing Cancer Institute & Chongqing Cancer Hospital, Chongqing 400030, China
| | - Xiaolong Shi
- Radiation & Cancer Biology Laboratory, Radiation Oncology Center, Chongqing Key Laboratory of Translational Research for Cancer Metastasis and Individualized Treatment, Chongqing University Cancer Hospital & Chongqing Cancer Institute & Chongqing Cancer Hospital, Chongqing 400030, China
| |
Collapse
|
10
|
Kallenborn F, Hildebrandt A, Schmidt B. CARE: context-aware sequencing read error correction. Bioinformatics 2021; 37:889-895. [PMID: 32818262 DOI: 10.1093/bioinformatics/btaa738] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2020] [Revised: 07/14/2020] [Accepted: 08/14/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Error correction is a fundamental pre-processing step in many Next-Generation Sequencing (NGS) pipelines, in particular for de novo genome assembly. However, existing error correction methods either suffer from high false-positive rates since they break reads into independent k-mers or do not scale efficiently to large amounts of sequencing reads and complex genomes. RESULTS We present CARE-an alignment-based scalable error correction algorithm for Illumina data using the concept of minhashing. Minhashing allows for efficient similarity search within large sequencing read collections which enables fast computation of high-quality multiple alignments. Sequencing errors are corrected by detailed inspection of the corresponding alignments. Our performance evaluation shows that CARE generates significantly fewer false-positive corrections than state-of-the-art tools (Musket, SGA, BFC, Lighter, Bcool, Karect) while maintaining a competitive number of true positives. When used prior to assembly it can achieve superior de novo assembly results for a number of real datasets. CARE is also the first multiple sequence alignment-based error corrector that is able to process a human genome Illumina NGS dataset in only 4 h on a single workstation using GPU acceleration. AVAILABILITYAND IMPLEMENTATION CARE is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at https://github.com/fkallen/CARE. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Felix Kallenborn
- Department of Computer Science, Johannes Gutenberg University, Mainz 55122, Germany
| | - Andreas Hildebrandt
- Department of Computer Science, Johannes Gutenberg University, Mainz 55122, Germany
| | - Bertil Schmidt
- Department of Computer Science, Johannes Gutenberg University, Mainz 55122, Germany
| |
Collapse
|
11
|
Yan Y, Chaturvedi N, Appuswamy R. Accel-Align: a fast sequence mapper and aligner based on the seed-embed-extend method. BMC Bioinformatics 2021; 22:257. [PMID: 34016035 PMCID: PMC8139006 DOI: 10.1186/s12859-021-04162-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Accepted: 05/04/2021] [Indexed: 12/30/2022] Open
Abstract
Background Improvements in sequencing technology continue to drive sequencing cost towards $100 per genome. However, mapping sequenced data to a reference genome remains a computationally-intensive task due to the dependence on edit distance for dealing with INDELs and mismatches introduced by sequencing. All modern aligners use seed–filter–extend methodology and rely on filtration heuristics to reduce the overhead of edit distance computation. However, filtering has inherent performance–accuracy trade-offs that limits its effectiveness. Results Motivated by algorithmic advances in randomized low-distortion embedding, we introduce SEE, a new methodology for developing sequence mappers and aligners. While SFE focuses on eliminating sub-optimal candidates, SEE focuses instead on identifying optimal candidates. To do so, SEE transforms the read and reference strings from edit distance regime to the Hamming regime by embedding them using a randomized algorithm, and uses Hamming distance over the embedded set to identify optimal candidates. To show that SEE performs well in practice, we present Accel-Align an SEE-based short-read sequence mapper and aligner that is 3–12\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\times$$\end{document}× faster than state-of-the-art aligners on commodity CPUs, without any special-purpose hardware, while providing comparable accuracy. Conclusions As sequencing technologies continue to increase read length while improving throughput and accuracy, we believe that randomized embeddings open up new avenues for optimization that cannot be achieved by using edit distance. Thus, the techniques presented in this paper have a much broader scope as they can be used for other applications like graph alignment, multiple sequence alignment, and sequence assembly. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04162-z.
Collapse
Affiliation(s)
- Yiqing Yan
- Data Science Department, EURECOM, 06410, Biot, France
| | | | | |
Collapse
|
12
|
Zou Y, Zhu Y, Li Y, Wu FX, Wang J. Parallel computing for genome sequence processing. Brief Bioinform 2021; 22:6210355. [PMID: 33822883 DOI: 10.1093/bib/bbab070] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Revised: 01/26/2021] [Accepted: 02/10/2021] [Indexed: 01/08/2023] Open
Abstract
The rapid increase of genome data brought by gene sequencing technologies poses a massive challenge to data processing. To solve the problems caused by enormous data and complex computing requirements, researchers have proposed many methods and tools which can be divided into three types: big data storage, efficient algorithm design and parallel computing. The purpose of this review is to investigate popular parallel programming technologies for genome sequence processing. Three common parallel computing models are introduced according to their hardware architectures, and each of which is classified into two or three types and is further analyzed with their features. Then, the parallel computing for genome sequence processing is discussed with four common applications: genome sequence alignment, single nucleotide polymorphism calling, genome sequence preprocessing, and pattern detection and searching. For each kind of application, its background is firstly introduced, and then a list of tools or algorithms are summarized in the aspects of principle, hardware platform and computing efficiency. The programming model of each hardware and application provides a reference for researchers to choose high-performance computing tools. Finally, we discuss the limitations and future trends of parallel computing technologies.
Collapse
Affiliation(s)
- You Zou
- Hunan Provincial Key Lab of Bioinformatics, School of Computer Science and Engineering at Central South University, Changsha, China
| | - Yuejie Zhu
- Hunan Provincial Key Lab of Bioinformatics, School of Computer Science and Engineering at Central South University, Changsha, China
| | - Yaohang Li
- computer science at Old Dominion University, USA
| | - Fang-Xiang Wu
- College of Engineering and the Department of Computer Science at the University of Saskatchewan, Saskatoon, Canada
| | - Jianxin Wang
- School of Computer Science and Engineering at Central South University, Changsha, Hunan, China
| |
Collapse
|
13
|
Alser M, Shahroodi T, Gómez-Luna J, Alkan C, Mutlu O. SneakySnake: a fast and accurate universal genome pre-alignment filter for CPUs, GPUs and FPGAs. Bioinformatics 2020; 36:5282-5290. [PMID: 33315064 DOI: 10.1093/bioinformatics/btaa1015] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Revised: 09/30/2020] [Accepted: 11/24/2020] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
We introduce SneakySnake, a highly parallel and highly accurate pre-alignment filter that remarkably reduces the need for computationally costly sequence alignment. The key idea of SneakySnake is to reduce the approximate string matching (ASM) problem to the single net routing (SNR) problem in VLSI chip layout. In the SNR problem, we are interested in finding the optimal path that connects two terminals with the least routing cost on a special grid layout that contains obstacles. The SneakySnake algorithm quickly solves the SNR problem and uses the found optimal path to decide whether or not performing sequence alignment is necessary. Reducing the ASM problem into SNR also makes SneakySnake efficient to implement on CPUs, GPUs and FPGAs.
Results
SneakySnake significantly improves the accuracy of pre-alignment filtering by up to four orders of magnitude compared to the state-of-the-art pre-alignment filters, Shouji, GateKeeper and SHD. For short sequences, SneakySnake accelerates Edlib (state-of-the-art implementation of Myers’s bit-vector algorithm) and Parasail (state-of-the-art sequence aligner with a configurable scoring function), by up to 37.7× and 43.9× (>12× on average), respectively, with its CPU implementation, and by up to 413× and 689× (>400× on average), respectively, with FPGA and GPU acceleration. For long sequences, the CPU implementation of SneakySnake accelerates Parasail and KSW2 (sequence aligner of minimap2) by up to 979× (276.9× on average) and 91.7× (31.7× on average), respectively. As SneakySnake does not replace sequence alignment, users can still obtain all capabilities (e.g. configurable scoring functions) of the aligner of their choice, unlike existing acceleration efforts that sacrifice some aligner capabilities.
Availabilityand implementation
https://github.com/CMU-SAFARI/SneakySnake.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mohammed Alser
- Department of Computer Science, ETH Zurich, Zurich 8006, Switzerland
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich 8006, Switzerland
| | - Taha Shahroodi
- Department of Computer Science, ETH Zurich, Zurich 8006, Switzerland
| | - Juan Gómez-Luna
- Department of Computer Science, ETH Zurich, Zurich 8006, Switzerland
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich 8006, Switzerland
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
| | - Onur Mutlu
- Department of Computer Science, ETH Zurich, Zurich 8006, Switzerland
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich 8006, Switzerland
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| |
Collapse
|
14
|
Graves CE, Li C, Sheng X, Miller D, Ignowski J, Kiyama L, Strachan JP. In-Memory Computing with Memristor Content Addressable Memories for Pattern Matching. ADVANCED MATERIALS (DEERFIELD BEACH, FLA.) 2020; 32:e2003437. [PMID: 32761709 DOI: 10.1002/adma.202003437] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/19/2020] [Revised: 06/24/2020] [Indexed: 05/04/2023]
Abstract
The dramatic rise of data-intensive workloads has revived application-specific computational hardware for continuing speed and power improvements, frequently achieved by limiting data movement and implementing "in-memory computation". However, conventional complementary metal oxide semiconductor (CMOS) circuit designs can still suffer low power efficiency, motivating designs leveraging nonvolatile resistive random access memory (ReRAM), and with many studies focusing on crossbar circuit architectures. Another circuit primitive-content addressable memory (CAM)-shows great promise for mapping a diverse range of computational models for in-memory computation, with recent ReRAM-CAM designs proposed but few experimentally demonstrated. Here, programming and control of memristors across an 86 × 12 memristor ternary CAM (TCAM) array integrated with CMOS are demonstrated, and parameter tradeoffs for optimizing speed and search margin are evaluated. In addition to smaller area, this memristor TCAM results in significantly lower power due to very low programmable conductance states, motivating CAM use in a wider range of computational applications than conventional TCAMs are confined to today. Finally, the first experimental demonstration of two computational models in memristor TCAM arrays is reported: regular expression matching in a finite state machine for network security intrusion detection and definable inexact pattern matching in a Levenshtein automata for genomic sequencing.
Collapse
Affiliation(s)
- Catherine E Graves
- Hewlett Packard Labs, Hewlett Packard Enterprise, Palo Alto, CA, 94304, USA
| | - Can Li
- Hewlett Packard Labs, Hewlett Packard Enterprise, Palo Alto, CA, 94304, USA
| | - Xia Sheng
- Hewlett Packard Labs, Hewlett Packard Enterprise, Palo Alto, CA, 94304, USA
| | - Darrin Miller
- Silicon Design Lab, Hewlett Packard Enterprise, Fort Collins, CO, 80528, USA
| | - Jim Ignowski
- Silicon Design Lab, Hewlett Packard Enterprise, Fort Collins, CO, 80528, USA
| | - Lennie Kiyama
- Hewlett Packard Labs, Hewlett Packard Enterprise, Palo Alto, CA, 94304, USA
| | - John Paul Strachan
- Hewlett Packard Labs, Hewlett Packard Enterprise, Palo Alto, CA, 94304, USA
| |
Collapse
|
15
|
Alser M, Hassan H, Kumar A, Mutlu O, Alkan C. Shouji: a fast and efficient pre-alignment filter for sequence alignment. Bioinformatics 2020; 35:4255-4263. [PMID: 30923804 DOI: 10.1093/bioinformatics/btz234] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2018] [Revised: 02/27/2019] [Accepted: 03/27/2019] [Indexed: 01/07/2023] Open
Abstract
MOTIVATION The ability to generate massive amounts of sequencing data continues to overwhelm the processing capability of existing algorithms and compute infrastructures. In this work, we explore the use of hardware/software co-design and hardware acceleration to significantly reduce the execution time of short sequence alignment, a crucial step in analyzing sequenced genomes. We introduce Shouji, a highly parallel and accurate pre-alignment filter that remarkably reduces the need for computationally-costly dynamic programming algorithms. The first key idea of our proposed pre-alignment filter is to provide high filtering accuracy by correctly detecting all common subsequences shared between two given sequences. The second key idea is to design a hardware accelerator that adopts modern field-programmable gate array (FPGA) architectures to further boost the performance of our algorithm. RESULTS Shouji significantly improves the accuracy of pre-alignment filtering by up to two orders of magnitude compared to the state-of-the-art pre-alignment filters, GateKeeper and SHD. Our FPGA-based accelerator is up to three orders of magnitude faster than the equivalent CPU implementation of Shouji. Using a single FPGA chip, we benchmark the benefits of integrating Shouji with five state-of-the-art sequence aligners, designed for different computing platforms. The addition of Shouji as a pre-alignment step reduces the execution time of the five state-of-the-art sequence aligners by up to 18.8×. Shouji can be adapted for any bioinformatics pipeline that performs sequence alignment for verification. Unlike most existing methods that aim to accelerate sequence alignment, Shouji does not sacrifice any of the aligner capabilities, as it does not modify or replace the alignment step. AVAILABILITY AND IMPLEMENTATION https://github.com/CMU-SAFARI/Shouji. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mohammed Alser
- Computer Science Department, ETH Zürich, Zürich, Switzerland.,Chair for Processor Design, Center For Advancing Electronics Dresden, Institute of Computer Engineering, Technische Universität Dresden, Dresden, Germany.,Computer Engineering Department, Bilkent University, Ankara, Turkey
| | - Hasan Hassan
- Computer Science Department, ETH Zürich, Zürich, Switzerland
| | - Akash Kumar
- Chair for Processor Design, Center For Advancing Electronics Dresden, Institute of Computer Engineering, Technische Universität Dresden, Dresden, Germany
| | - Onur Mutlu
- Computer Science Department, ETH Zürich, Zürich, Switzerland.,Computer Engineering Department, Bilkent University, Ankara, Turkey
| | - Can Alkan
- Computer Engineering Department, Bilkent University, Ankara, Turkey
| |
Collapse
|
16
|
Senol Cali D, Kim JS, Ghose S, Alkan C, Mutlu O. Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions. Brief Bioinform 2019; 20:1542-1559. [PMID: 29617724 PMCID: PMC6781587 DOI: 10.1093/bib/bby017] [Citation(s) in RCA: 108] [Impact Index Per Article: 21.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2017] [Revised: 02/06/2018] [Indexed: 02/06/2023] Open
Abstract
Nanopore sequencing technology has the potential to render other sequencing technologies obsolete with its ability to generate long reads and provide portability. However, high error rates of the technology pose a challenge while generating accurate genome assemblies. The tools used for nanopore sequence analysis are of critical importance, as they should overcome the high error rates of the technology. Our goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis to understand their advantages, disadvantages and performance bottlenecks. It is important to understand where the current tools do not perform well to develop better tools. To this end, we (1) analyze the multiple steps and the associated tools in the genome assembly pipeline using nanopore sequence data, and (2) provide guidelines for determining the appropriate tools for each step. Based on our analyses, we make four key observations: (1) the choice of the tool for basecalling plays a critical role in overcoming the high error rates of nanopore sequencing technology. (2) Read-to-read overlap finding tools, GraphMap and Minimap, perform similarly in terms of accuracy. However, Minimap has a lower memory usage, and it is faster than GraphMap. (3) There is a trade-off between accuracy and performance when deciding on the appropriate tool for the assembly step. The fast but less accurate assembler Miniasm can be used for quick initial assembly, and further polishing can be applied on top of it to increase the accuracy, which leads to faster overall assembly. (4) The state-of-the-art polishing tool, Racon, generates high-quality consensus sequences while providing a significant speedup over another polishing tool, Nanopolish. We analyze various combinations of different tools and expose the trade-offs between accuracy, performance, memory usage and scalability. We conclude that our observations can guide researchers and practitioners in making conscious and effective choices for each step of the genome assembly pipeline using nanopore sequence data. Also, with the help of bottlenecks we have found, developers can improve the current tools or build new ones that are both accurate and fast, to overcome the high error rates of the nanopore sequencing technology.
Collapse
Affiliation(s)
- Damla Senol Cali
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Jeremie S Kim
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
- Department of Computer Science, Systems Group, ETH Zürich, Zürich, Switzerland
| | - Saugata Ghose
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Bilkent, Ankara, Turkey
| | - Onur Mutlu
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
- Department of Computer Science, Systems Group, ETH Zürich, Zürich, Switzerland
| |
Collapse
|
17
|
Bayat A, Gaëta B, Ignjatovic A, Parameswaran S. Pairwise alignment of nucleotide sequences using maximal exact matches. BMC Bioinformatics 2019; 20:261. [PMID: 31113356 PMCID: PMC6528274 DOI: 10.1186/s12859-019-2827-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2018] [Accepted: 04/17/2019] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Pairwise alignment of short DNA sequences with affine-gap scoring is a common processing step performed in a range of bioinformatics analyses. Dynamic programming (i.e. Smith-Waterman algorithm) is widely used for this purpose. Despite using data level parallelisation, pairwise alignment consumes much time. There are faster alignment algorithms but they suffer from the lack of accuracy. RESULTS In this paper, we present MEM-Align, a fast semi-global alignment algorithm for short DNA sequences that allows for affine-gap scoring and exploit sequence similarity. In contrast to traditional alignment method (such as Smith-Waterman) where individual symbols are aligned, MEM-Align extracts Maximal Exact Matches (MEMs) using a bit-level parallel method and then looks for a subset of MEMs that forms the alignment using a novel dynamic programming method. MEM-Align tries to mimic alignment produced by Smith-Waterman. As a result, for 99.9% of input sequence pair, the computed alignment score is identical to the alignment score computed by Smith-Waterman. Yet MEM-Align is up to 14.5 times faster than the Smith-Waterman algorithm. Fast run-time is achieved by: (a) using a bit-level parallel method to extract MEMs; (b) processing MEMs rather than individual symbols; and, (c) applying heuristics. CONCLUSIONS MEM-Align is a potential candidate to replace other pairwise alignment algorithms used in processes such as DNA read-mapping and Variant-Calling.
Collapse
Affiliation(s)
- Arash Bayat
- School of Computer Science and Engineering, University of New South Wales (UNSW), Sydney, 2052 Australia
- Health and Biosecurity, CSIRO, 53/11 Julius Ave, North Ryde, Sydney, 2113 Australia
| | - Bruno Gaëta
- School of Computer Science and Engineering, University of New South Wales (UNSW), Sydney, 2052 Australia
| | - Aleksandar Ignjatovic
- School of Computer Science and Engineering, University of New South Wales (UNSW), Sydney, 2052 Australia
| | - Sri Parameswaran
- School of Computer Science and Engineering, University of New South Wales (UNSW), Sydney, 2052 Australia
| |
Collapse
|
18
|
Alser M, Hassan H, Xin H, Ergin O, Mutlu O, Alkan C. GateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping. Bioinformatics 2018; 33:3355-3363. [PMID: 28575161 DOI: 10.1093/bioinformatics/btx342] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2016] [Accepted: 05/29/2017] [Indexed: 01/06/2023] Open
Abstract
Motivation High throughput DNA sequencing (HTS) technologies generate an excessive number of small DNA segments -called short reads- that cause significant computational burden. To analyze the entire genome, each of the billions of short reads must be mapped to a reference genome based on the similarity between a read and 'candidate' locations in that reference genome. The similarity measurement, called alignment, formulated as an approximate string matching problem, is the computational bottleneck because: (i) it is implemented using quadratic-time dynamic programming algorithms and (ii) the majority of candidate locations in the reference genome do not align with a given read due to high dissimilarity. Calculating the alignment of such incorrect candidate locations consumes an overwhelming majority of a modern read mapper's execution time. Therefore, it is crucial to develop a fast and effective filter that can detect incorrect candidate locations and eliminate them before invoking computationally costly alignment algorithms. Results We propose GateKeeper, a new hardware accelerator that functions as a pre-alignment step that quickly filters out most incorrect candidate locations. GateKeeper is the first design to accelerate pre-alignment using Field-Programmable Gate Arrays (FPGAs), which can perform pre-alignment much faster than software. When implemented on a single FPGA chip, GateKeeper maintains high accuracy (on average >96%) while providing, on average, 90-fold and 130-fold speedup over the state-of-the-art software pre-alignment techniques, Adjacency Filter and Shifted Hamming Distance (SHD), respectively. The addition of GateKeeper as a pre-alignment step can reduce the verification time of the mrFAST mapper by a factor of 10. Availability and implementation https://github.com/BilkentCompGen/GateKeeper. Contact mohammedalser@bilkent.edu.tr or onur.mutlu@inf.ethz.ch or calkan@cs.bilkent.edu.tr. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mohammed Alser
- Department of Computer Engineering, Bilkent University, Bilkent, Ankara 06800, Turkey
| | - Hasan Hassan
- TOBB University of Economics & Technology, Sogutozu, Ankara, Turkey
- Department of Computer Science, ETH Zürich, 8092 Zürich, Switzerland
| | - Hongyi Xin
- Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Oguz Ergin
- TOBB University of Economics & Technology, Sogutozu, Ankara, Turkey
| | - Onur Mutlu
- Department of Computer Science, ETH Zürich, 8092 Zürich, Switzerland
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Bilkent, Ankara 06800, Turkey
| |
Collapse
|
19
|
Kim JS, Senol Cali D, Xin H, Lee D, Ghose S, Alser M, Hassan H, Ergin O, Alkan C, Mutlu O. GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies. BMC Genomics 2018; 19:89. [PMID: 29764378 PMCID: PMC5954284 DOI: 10.1186/s12864-018-4460-0] [Citation(s) in RCA: 61] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Background Seed location filtering is critical in DNA read mapping, a process where billions of DNA fragments (reads) sampled from a donor are mapped onto a reference genome to identify genomic variants of the donor. State-of-the-art read mappers 1) quickly generate possible mapping locations for seeds (i.e., smaller segments) within each read, 2) extract reference sequences at each of the mapping locations, and 3) check similarity between each read and its associated reference sequences with a computationally-expensive algorithm (i.e., sequence alignment) to determine the origin of the read. A seed location filter comes into play before alignment, discarding seed locations that alignment would deem a poor match. The ideal seed location filter would discard all poor match locations prior to alignment such that there is no wasted computation on unnecessary alignments. Results We propose a novel seed location filtering algorithm, GRIM-Filter, optimized to exploit 3D-stacked memory systems that integrate computation within a logic layer stacked under memory layers, to perform processing-in-memory (PIM). GRIM-Filter quickly filters seed locations by 1) introducing a new representation of coarse-grained segments of the reference genome, and 2) using massively-parallel in-memory operations to identify read presence within each coarse-grained segment. Our evaluations show that for a sequence alignment error tolerance of 0.05, GRIM-Filter 1) reduces the false negative rate of filtering by 5.59x–6.41x, and 2) provides an end-to-end read mapper speedup of 1.81x–3.65x, compared to a state-of-the-art read mapper employing the best previous seed location filtering algorithm. Conclusion GRIM-Filter exploits 3D-stacked memory, which enables the efficient use of processing-in-memory, to overcome the memory bandwidth bottleneck in seed location filtering. We show that GRIM-Filter significantly improves the performance of a state-of-the-art read mapper. GRIM-Filter is a universal seed location filter that can be applied to any read mapper. We hope that our results provide inspiration for new works to design other bioinformatics algorithms that take advantage of emerging technologies and new processing paradigms, such as processing-in-memory using 3D-stacked memory devices.
Collapse
Affiliation(s)
- Jeremie S Kim
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA. .,Department of Computer Science, ETH Zürich, Zürich, CH, Switzerland.
| | - Damla Senol Cali
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Hongyi Xin
- Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | | | - Saugata Ghose
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Mohammed Alser
- Department of Computer Engineering, Bilkent University, Bilkent, Ankara, Turkey
| | - Hasan Hassan
- Department of Computer Science, ETH Zürich, Zürich, CH, Switzerland
| | - Oguz Ergin
- Department of Computer Engineering, TOBB University of Economics and Technology, Sogutozu, Ankara, Turkey
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Bilkent, Ankara, Turkey
| | - Onur Mutlu
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA. .,Department of Computer Science, ETH Zürich, Zürich, CH, Switzerland.
| |
Collapse
|
20
|
Schmidt B, Hildebrandt A. Next-generation sequencing: big data meets high performance computing. Drug Discov Today 2017; 22:712-717. [DOI: 10.1016/j.drudis.2017.01.014] [Citation(s) in RCA: 47] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2016] [Revised: 12/16/2016] [Accepted: 01/25/2017] [Indexed: 12/17/2022]
|
21
|
Xin H, Nahar S, Zhu R, Emmons J, Pekhimenko G, Kingsford C, Alkan C, Mutlu O. Optimal seed solver: optimizing seed selection in read mapping. ACTA ACUST UNITED AC 2015; 32:1632-42. [PMID: 26568624 DOI: 10.1093/bioinformatics/btv670] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2015] [Accepted: 11/09/2015] [Indexed: 11/12/2022]
Abstract
MOTIVATION Optimizing seed selection is an important problem in read mapping. The number of non-overlapping seeds a mapper selects determines the sensitivity of the mapper while the total frequency of all selected seeds determines the speed of the mapper. Modern seed-and-extend mappers usually select seeds with either an equal and fixed-length scheme or with an inflexible placement scheme, both of which limit the ability of the mapper in selecting less frequent seeds to speed up the mapping process. Therefore, it is crucial to develop a new algorithm that can adjust both the individual seed length and the seed placement, as well as derive less frequent seeds. RESULTS We present the Optimal Seed Solver (OSS), a dynamic programming algorithm that discovers the least frequently-occurring set of x seeds in an L-base-pair read in [Formula: see text] operations on average and in [Formula: see text] operations in the worst case, while generating a maximum of [Formula: see text] seed frequency database lookups. We compare OSS against four state-of-the-art seed selection schemes and observe that OSS provides a 3-fold reduction in average seed frequency over the best previous seed selection optimizations. AVAILABILITY AND IMPLEMENTATION We provide an implementation of the Optimal Seed Solver in C++ at: https://github.com/CMU-SAFARI/Optimal-Seed-Solver CONTACT hxin@cmu.edu, calkan@cs.bilkent.edu.tr or onur@cmu.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | | | - John Emmons
- Department of Computer Science and Engineering, Washington University, St. Louis, MO 63130, USA
| | | | - Carl Kingsford
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Bilkent, Ankara 06800, Turkey and
| | - Onur Mutlu
- Computer Science Department, Department of Electrical and Computer Engineering
| |
Collapse
|