1
|
Zhou J, Chen S, Wu Y, Li H, Zhang B, Zhou L, Hu Y, Xiang Z, Li Z, Chen N, Han W, Xu C, Wang D, Gao X. PPML-Omics: A privacy-preserving federated machine learning method protects patients' privacy in omic data. SCIENCE ADVANCES 2024; 10:eadh8601. [PMID: 38295178 PMCID: PMC10830108 DOI: 10.1126/sciadv.adh8601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/18/2023] [Accepted: 12/29/2023] [Indexed: 02/02/2024]
Abstract
Modern machine learning models toward various tasks with omic data analysis give rise to threats of privacy leakage of patients involved in those datasets. Here, we proposed a secure and privacy-preserving machine learning method (PPML-Omics) by designing a decentralized differential private federated learning algorithm. We applied PPML-Omics to analyze data from three sequencing technologies and addressed the privacy concern in three major tasks of omic data under three representative deep learning models. We examined privacy breaches in depth through privacy attack experiments and demonstrated that PPML-Omics could protect patients' privacy. In each of these applications, PPML-Omics was able to outperform methods of comparison under the same level of privacy guarantee, demonstrating the versatility of the method in simultaneously balancing the privacy-preserving capability and utility in omic data analysis. Furthermore, we gave the theoretical proof of the privacy-preserving capability of PPML-Omics, suggesting the first mathematically guaranteed method with robust and generalizable empirical performance in protecting patients' privacy in omic data.
Collapse
Affiliation(s)
- Juexiao Zhou
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Siyuan Chen
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Yulian Wu
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Haoyang Li
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Bin Zhang
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Longxi Zhou
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Yan Hu
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Zihang Xiang
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Zhongxiao Li
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Ningning Chen
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Wenkai Han
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Chencheng Xu
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Di Wang
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Xin Gao
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| |
Collapse
|
2
|
Aguado-Puig Q, Doblas M, Matzoros C, Espinosa A, Moure JC, Marco-Sola S, Moreto M. WFA-GPU: gap-affine pairwise read-alignment using GPUs. Bioinformatics 2023; 39:btad701. [PMID: 37975878 PMCID: PMC10697739 DOI: 10.1093/bioinformatics/btad701] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Revised: 11/09/2023] [Accepted: 11/16/2023] [Indexed: 11/19/2023] Open
Abstract
MOTIVATION Advances in genomics and sequencing technologies demand faster and more scalable analysis methods that can process longer sequences with higher accuracy. However, classical pairwise alignment methods, based on dynamic programming (DP), impose impractical computational requirements to align long and noisy sequences like those produced by PacBio and Nanopore technologies. The recently proposed wavefront alignment (WFA) algorithm paves the way for more efficient alignment tools, improving time and memory complexity over previous methods. However, high-performance computing (HPC) platforms require efficient parallel algorithms and tools to exploit the computing resources available on modern accelerator-based architectures. RESULTS This paper presents WFA-GPU, a GPU (graphics processing unit)-accelerated tool to compute exact gap-affine alignments based on the WFA algorithm. We present the algorithmic adaptations and performance optimizations that allow exploiting the massively parallel capabilities of modern GPU devices to accelerate the alignment computations. In particular, we propose a CPU-GPU co-design capable of performing inter-sequence and intra-sequence parallel sequence alignment, combining a succinct WFA-data representation with an efficient GPU implementation. As a result, we demonstrate that our implementation outperforms the original multi-threaded WFA implementation by up to 4.3× and up to 18.2× when using heuristic methods on long and noisy sequences. Compared to other state-of-the-art tools and libraries, the WFA-GPU is up to 29× faster than other GPU implementations and up to four orders of magnitude faster than other CPU implementations. Furthermore, WFA-GPU is the only GPU solution capable of correctly aligning long reads using a commodity GPU. AVAILABILITY AND IMPLEMENTATION WFA-GPU code and documentation are publicly available at https://github.com/quim0/WFA-GPU.
Collapse
Affiliation(s)
- Quim Aguado-Puig
- Departament d’Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona 08193, Spain
| | - Max Doblas
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain
| | - Christos Matzoros
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain
| | - Antonio Espinosa
- Departament d’Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona 08193, Spain
| | - Juan Carlos Moure
- Departament d’Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona 08193, Spain
| | - Santiago Marco-Sola
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain
- Departament d’Arquitectura de Computadors, Universitat Politècnica de Catalunya, Barcelona 08034, Spain
| | - Miquel Moreto
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain
- Departament d’Arquitectura de Computadors, Universitat Politècnica de Catalunya, Barcelona 08034, Spain
| |
Collapse
|
3
|
Langarita R, Armejach A, Ibanez P, Alastruey-Benede J, Moreto M. Porting and Optimizing BWA-MEM2 Using the Fujitsu A64FX Processor. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:3139-3153. [PMID: 37018085 DOI: 10.1109/tcbb.2023.3264514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Sequence alignment pipelines for human genomes are an emerging workload that will dominate in the precision medicine field. BWA-MEM2 is a tool widely used in the scientific community to perform read mapping studies. In this paper, we port BWA-MEM2 to the AArch64 architecture using the ARMv8-A specification, and we compare the resulting version against an Intel Skylake system both in performance and in energy-to-solution. The porting effort entails numerous code modifications, since BWA-MEM2 implements certain kernels using x86_64 specific intrinsics, e.g., AVX-512. To adapt this code we use the recently introduced Arm's Scalable Vector Extensions (SVE). More specifically, we use Fujitsu's A64FX processor, the first to implement SVE. The A64FX powers the Fugaku Supercomputer that led the Top500 ranking from June 2020 to November 2021. After porting BWA-MEM2 we define and implement a number of optimizations to improve performance in the A64FX target architecture. We show that while the A64FX performance is lower than that of the Skylake system, A64FX delivers 11.6% better energy-to-solution on average. All the code used for this article is available at https://gitlab.bsc.es/rlangari/bwa-a64fx.
Collapse
|
4
|
Diab S, Nassereldine A, Alser M, Gómez Luna J, Mutlu O, El Hajj I. A framework for high-throughput sequence alignment using real processing-in-memory systems. Bioinformatics 2023; 39:btad155. [PMID: 36971586 PMCID: PMC10159653 DOI: 10.1093/bioinformatics/btad155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Revised: 02/24/2023] [Accepted: 03/25/2023] [Indexed: 03/29/2023] Open
Abstract
MOTIVATION Sequence alignment is a memory bound computation whose performance in modern systems is limited by the memory bandwidth bottleneck. Processing-in-memory (PIM) architectures alleviate this bottleneck by providing the memory with computing competencies. We propose Alignment-in-Memory (AIM), a framework for high-throughput sequence alignment using PIM, and evaluate it on UPMEM, the first publicly available general-purpose programmable PIM system. RESULTS Our evaluation shows that a real PIM system can substantially outperform server-grade multi-threaded CPU systems running at full-scale when performing sequence alignment for a variety of algorithms, read lengths, and edit distance thresholds. We hope that our findings inspire more work on creating and accelerating bioinformatics algorithms for such real PIM systems. AVAILABILITY AND IMPLEMENTATION Our code is available at https://github.com/safaad/aim.
Collapse
Affiliation(s)
- Safaa Diab
- Department of Computer Science, American University of Beirut, Riad El-Solh, Beirut 1107 2020, Lebanon
| | - Amir Nassereldine
- Department of Computer Science, American University of Beirut, Riad El-Solh, Beirut 1107 2020, Lebanon
| | - Mohammed Alser
- Department of Information Technology and Electrical Engineering, ETH Zürich, Gloriastrasse 35, Zürich 8092, Switzerland
| | - Juan Gómez Luna
- Department of Information Technology and Electrical Engineering, ETH Zürich, Gloriastrasse 35, Zürich 8092, Switzerland
| | - Onur Mutlu
- Department of Information Technology and Electrical Engineering, ETH Zürich, Gloriastrasse 35, Zürich 8092, Switzerland
| | - Izzat El Hajj
- Department of Computer Science, American University of Beirut, Riad El-Solh, Beirut 1107 2020, Lebanon
| |
Collapse
|
5
|
Khatamifard SK, Chowdhury Z, Pande N, Razaviyayn M, Kim C, Karpuzcu UR. GeNVoM: Read Mapping Near Non-Volatile Memory. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3482-3496. [PMID: 34613917 DOI: 10.1109/tcbb.2021.3118018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
DNA sequencing is the physical/biochemical process of identifying the location of the four bases (Adenine, Guanine, Cytosine, Thymine) in a DNA strand. As semiconductor technology revolutionized computing, modern DNA sequencing technology (termed Next Generation Sequencing, NGS) revolutionized genomic research. As a result, modern NGS platforms can sequence hundreds of millions of short DNA fragments in parallel. The sequenced DNA fragments, representing the output of NGS platforms, are termed reads. Besides genomic variations, NGS imperfections induce noise in reads. Mapping each read to (the most similar portion of) a reference genome of the same species, i.e., read mapping, is a common critical first step in a diverse set of emerging bioinformatics applications. Mapping represents a search-heavy memory-intensive similarity matching problem, therefore, can greatly benefit from near-memory processing. Intuition suggests using fast associative search enabled by Ternary Content Addressable Memory (TCAM) by construction. However, the excessive energy consumption and lack of support for similarity matching (under NGS and genomic variation induced noise) renders direct application of TCAM infeasible, irrespective of volatility, where only non-volatile TCAM can accommodate the large memory footprint in an area-efficient way. This paper introduces GeNVoM, a scalable, energy-efficient and high-throughput solution. Instead of optimizing an algorithm developed for general-purpose computers or GPUs, GeNVoM rethinks the algorithm and non-volatile TCAM-based accelerator design together from the ground up. Thereby GeNVoM can improve the throughput by up to 3.67×; the energy consumption, by up to 1.36×, when compared to an ASIC baseline, which represents one of the highest-throughput implementations known.
Collapse
|
6
|
Yu C, Zhao Y, Zhao C, Ma H, Wang G. DiagAF: A More Accurate and Efficient Pre-Alignment Filter for Sequence Alignment. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3404-3415. [PMID: 34780330 DOI: 10.1109/tcbb.2021.3127879] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Sequence alignment is an essential step in computational genomics. More accurate and efficient sequence pre-alignment methods that run before conducting expensive computation for final verification are still urgently needed. In this article, we propose a more accurate and efficient pre-alignment algorithm for sequence alignment, called DiagAF. Firstly, DiagAF uses a new lower bound of edit distance based on shift hamming masks. The new lower bound makes use of fewer shift hamming masks comparing with state-of-the-art algorithms such as SHD and MAGNET. Moreover, it takes account the information of edit distance path exchanging on shift hamming masks. Secondly, DiagAF can deal with alignments of sequence pairs with not equal length, rather than state-of-the-art methods just for equal length. Thirdly, DiagAF can align sequences with early termination for true alignments. In the experiment, we compared DiagAF with state-of-the-art methods. DiagAF can achieve a much smaller error rate than them, meanwhile use less time than them. We believe that DiagAF algorithm can further improve the performance of state-of-the-art sequence alignment softwares. The source codes of DiagAF can be downloaded from web site https://github.com/BioLab-cz/DiagAF.
Collapse
|
7
|
Gudur VY, Maheshwari S, Acharyya A, Shafik R. An FPGA Based Energy-Efficient Read Mapper With Parallel Filtering and In-Situ Verification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2697-2711. [PMID: 34415836 DOI: 10.1109/tcbb.2021.3106311] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
In the assembly pipeline of Whole Genome Sequencing (WGS), read mapping is a widely used method to re-assemble the genome. It employs approximate string matching and dynamic programming-based algorithms on a large volume of data and associated structures, making it a computationally intensive process. Currently, the state-of-the-art data centers for genome sequencing incur substantial setup and energy costs for maintaining hardware, data storage and cooling systems. To enable low-cost genomics, we propose an energy-efficient architectural methodology for read mapping using a single system-on-chip (SoC) platform. The proposed methodology is based on the q-gram lemma and designed using a novel architecture for filtering and verification. The filtering algorithm is designed using a parallel sorted q-gram lemma based method for the first time, and it is complemented by an in-situ verification routine using parallel Myers bit-vector algorithm. We have implemented our design on the Zynq Ultrascale+ XCZU9EG MPSoC platform. It is then extensively validated using real genomic data to demonstrate up to 7.8× energy reduction and up to 13.3× less resource utilization when compared with the state-of-the-art software and hardware approaches.
Collapse
|
8
|
Alser M, Lindegger J, Firtina C, Almadhoun N, Mao H, Singh G, Gomez-Luna J, Mutlu O. From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures. Comput Struct Biotechnol J 2022; 20:4579-4599. [PMID: 36090814 PMCID: PMC9436709 DOI: 10.1016/j.csbj.2022.08.019] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Revised: 08/08/2022] [Accepted: 08/08/2022] [Indexed: 02/01/2023] Open
Abstract
We now need more than ever to make genome analysis more intelligent. We need to read, analyze, and interpret our genomes not only quickly, but also accurately and efficiently enough to scale the analysis to population level. There currently exist major computational bottlenecks and inefficiencies throughout the entire genome analysis pipeline, because state-of-the-art genome sequencing technologies are still not able to read a genome in its entirety. We describe the ongoing journey in significantly improving the performance, accuracy, and efficiency of genome analysis using intelligent algorithms and hardware architectures. We explain state-of-the-art algorithmic methods and hardware-based acceleration approaches for each step of the genome analysis pipeline and provide experimental evaluations. Algorithmic approaches exploit the structure of the genome as well as the structure of the underlying hardware. Hardware-based acceleration approaches exploit specialized microarchitectures or various execution paradigms (e.g., processing inside or near memory) along with algorithmic changes, leading to new hardware/software co-designed systems. We conclude with a foreshadowing of future challenges, benefits, and research directions triggered by the development of both very low cost yet highly error prone new sequencing technologies and specialized hardware chips for genomics. We hope that these efforts and the challenges we discuss provide a foundation for future work in making genome analysis more intelligent.
Collapse
Affiliation(s)
| | | | - Can Firtina
- ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland
| | | | - Haiyu Mao
- ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland
| | | | | | - Onur Mutlu
- ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland
| |
Collapse
|
9
|
Langarita R, Armejach A, Setoain J, Ibanez-Marin P, Alastruey-Benede J, Moreto M. Compressed Sparse FM-Index: Fast Sequence Alignment Using Large K-Steps. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:355-368. [PMID: 32750858 DOI: 10.1109/tcbb.2020.3000253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The FM-index is a data structure used in genomics for exact search of input sequences over large reference genomes. Algorithms based on the FM-index show an irregular memory access pattern, resulting in a memory bound problem. We analyze a recent implementation of the FM-index and highlight existing throughput-memory trade-offs, showing that memory requirements limit implementation of large k-steps. We propose COFI, a COmpressed FM-Index for large K-steps. COFI enables a 15-step FM-index using less than 16 GB for a human genome reference of 3 giga base pairs. An algorithm based on this new layout is evaluated on both a Knights Landing (KNL) and an Skylake-based system (SKX). We achieve average speed-ups of 1.46× and 1.39×, respectively, with respect to an state-of-the-art FM-index implementation that is already well optimized.
Collapse
|
10
|
Alser M, Rotman J, Deshpande D, Taraszka K, Shi H, Baykal PI, Yang HT, Xue V, Knyazev S, Singer BD, Balliu B, Koslicki D, Skums P, Zelikovsky A, Alkan C, Mutlu O, Mangul S. Technology dictates algorithms: recent developments in read alignment. Genome Biol 2021; 22:249. [PMID: 34446078 PMCID: PMC8390189 DOI: 10.1186/s13059-021-02443-7] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 07/28/2021] [Indexed: 01/08/2023] Open
Abstract
Aligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today's diverse array of alignment methods. We provide a systematic survey of algorithmic foundations and methodologies across 107 alignment methods, for both short and long reads. We provide a rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read alignment. We discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology.
Collapse
Affiliation(s)
- Mohammed Alser
- Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
| | - Jeremy Rotman
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Dhrithi Deshpande
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA
| | - Kodi Taraszka
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Huwenbo Shi
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Pelin Icer Baykal
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Harry Taegyun Yang
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
- Bioinformatics Interdepartmental Ph.D. Program, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Victor Xue
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Benjamin D Singer
- Division of Pulmonary and Critical Care Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
- Department of Biochemistry & Molecular Genetics, Northwestern University Feinberg School of Medicine, Chicago, USA
- Simpson Querrey Institute for Epigenetics, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Brunilda Balliu
- Department of Computational Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - David Koslicki
- Computer Science and Engineering, Pennsylvania State University, University Park, PA, 16801, USA
- Biology Department, Pennsylvania State University, University Park, PA, 16801, USA
- The Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, 16801, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
- The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, 119991, Russia
| | - Can Alkan
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Bilkent-Hacettepe Health Sciences and Technologies Program, Ankara, Turkey
| | - Onur Mutlu
- Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA.
| |
Collapse
|
11
|
Maheshwari S, Gudur VY, Shafik R, Wilson I, Yakovlev A, Acharyya A. CORAL: Verification-Aware OpenCL Based Read Mapper for Heterogeneous Systems. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1426-1438. [PMID: 31562102 DOI: 10.1109/tcbb.2019.2943856] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Genomics has the potential to transform medicine from reactive to a personalized, predictive, preventive, and participatory (P4) form. Being a Big Data application with continuously increasing rate of data production, the computational costs of genomics have become a daunting challenge. Most modern computing systems are heterogeneous consisting of various combinations of computing resources, such as CPUs, GPUs, and FPGAs. They require platform-specific software and languages to program making their simultaneous operation challenging. Existing read mappers and analysis tools in the whole genome sequencing (WGS) pipeline do not scale for such heterogeneity. Additionally, the computational cost of mapping reads is high due to expensive dynamic programming based verification, where optimized implementations are already available. Thus, improvement in filtration techniques is needed to reduce verification overhead. To address the aforementioned limitations with regards to the mapping element of the WGS pipeline, we propose a Cross-platfOrm Read mApper using opencL (CORAL). CORAL is capable of executing on heterogeneous devices/platforms, simultaneously. It can reduce computational time by suitably distributing the workload without any additional programming effort. We showcase this on a quadcore Intel CPU along with two Nvidia GTX 590 GPUs, distributing the workload judiciously to achieve up to 2× speedup compared to when, only, the CPUs are used. To reduce the verification overhead, CORAL dynamically adapts k-mer length during filtration. We demonstrate competitive timings in comparison with other mappers using real and simulated reads. CORAL is available at: https://github.com/nclaes/CORAL.
Collapse
|
12
|
Robinson T, Harkin J, Shukla P. Hardware Acceleration of Genomics Data Analysis: Challenges and Opportunities. Bioinformatics 2021; 37:1785-1795. [PMID: 34037688 PMCID: PMC8317111 DOI: 10.1093/bioinformatics/btab017] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Revised: 11/03/2020] [Accepted: 05/24/2021] [Indexed: 12/11/2022] Open
Abstract
The significant decline in the cost of genome sequencing has dramatically changed the typical bioinformatics pipeline for analysing sequencing data. Where traditionally, the computational challenge of sequencing is now secondary to genomic data analysis. Short read alignment (SRA) is a ubiquitous process within every modern bioinformatics pipeline in the field of genomics and is often regarded as the principal computational bottleneck. Many hardware and software approaches have been provided to solve the challenge of acceleration. However, previous attempts to increase throughput using many-core processing strategies have enjoyed limited success, mainly due to a dependence on global memory for each computational block. The limited scalability and high energy costs of many-core SRA implementations pose a significant constraint in maintaining acceleration. The Networks-On-Chip (NoC) hardware interconnect mechanism has advanced the scalability of many-core computing systems and, more recently, has demonstrated potential in SRA implementations by integrating multiple computational blocks such as pre-alignment filtering and sequence alignment efficiently, while minimising memory latency and global memory access. This paper provides a state of the art review on current hardware acceleration strategies for genomic data analysis, and it establishes the challenges and opportunities of utilising NoCs as a critical building block in next-generation sequencing (NGS) technologies for advancing the speed of analysis.
Collapse
Affiliation(s)
- Tony Robinson
- School of Computing, Engineering and Intelligent Systems, Ulster University, Magee Campus, Derry/Londonderry, BT48 7JL, UK
| | - Jim Harkin
- School of Computing, Engineering and Intelligent Systems, Ulster University, Magee Campus, Derry/Londonderry, BT48 7JL, UK
| | - Priyank Shukla
- Northern Ireland Centre for Stratified Medicine, Biomedical Sciences Research Institute, Ulster University, C-TRIC Building, Altnagelvin Area Hospital, Derry/Londonderry, BT47 6SB, UK
| |
Collapse
|
13
|
Zou Y, Zhu Y, Li Y, Wu FX, Wang J. Parallel computing for genome sequence processing. Brief Bioinform 2021; 22:6210355. [PMID: 33822883 DOI: 10.1093/bib/bbab070] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Revised: 01/26/2021] [Accepted: 02/10/2021] [Indexed: 01/08/2023] Open
Abstract
The rapid increase of genome data brought by gene sequencing technologies poses a massive challenge to data processing. To solve the problems caused by enormous data and complex computing requirements, researchers have proposed many methods and tools which can be divided into three types: big data storage, efficient algorithm design and parallel computing. The purpose of this review is to investigate popular parallel programming technologies for genome sequence processing. Three common parallel computing models are introduced according to their hardware architectures, and each of which is classified into two or three types and is further analyzed with their features. Then, the parallel computing for genome sequence processing is discussed with four common applications: genome sequence alignment, single nucleotide polymorphism calling, genome sequence preprocessing, and pattern detection and searching. For each kind of application, its background is firstly introduced, and then a list of tools or algorithms are summarized in the aspects of principle, hardware platform and computing efficiency. The programming model of each hardware and application provides a reference for researchers to choose high-performance computing tools. Finally, we discuss the limitations and future trends of parallel computing technologies.
Collapse
Affiliation(s)
- You Zou
- Hunan Provincial Key Lab of Bioinformatics, School of Computer Science and Engineering at Central South University, Changsha, China
| | - Yuejie Zhu
- Hunan Provincial Key Lab of Bioinformatics, School of Computer Science and Engineering at Central South University, Changsha, China
| | - Yaohang Li
- computer science at Old Dominion University, USA
| | - Fang-Xiang Wu
- College of Engineering and the Department of Computer Science at the University of Saskatchewan, Saskatoon, Canada
| | - Jianxin Wang
- School of Computer Science and Engineering at Central South University, Changsha, Hunan, China
| |
Collapse
|
14
|
Alser M, Shahroodi T, Gómez-Luna J, Alkan C, Mutlu O. SneakySnake: a fast and accurate universal genome pre-alignment filter for CPUs, GPUs and FPGAs. Bioinformatics 2020; 36:5282-5290. [PMID: 33315064 DOI: 10.1093/bioinformatics/btaa1015] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Revised: 09/30/2020] [Accepted: 11/24/2020] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
We introduce SneakySnake, a highly parallel and highly accurate pre-alignment filter that remarkably reduces the need for computationally costly sequence alignment. The key idea of SneakySnake is to reduce the approximate string matching (ASM) problem to the single net routing (SNR) problem in VLSI chip layout. In the SNR problem, we are interested in finding the optimal path that connects two terminals with the least routing cost on a special grid layout that contains obstacles. The SneakySnake algorithm quickly solves the SNR problem and uses the found optimal path to decide whether or not performing sequence alignment is necessary. Reducing the ASM problem into SNR also makes SneakySnake efficient to implement on CPUs, GPUs and FPGAs.
Results
SneakySnake significantly improves the accuracy of pre-alignment filtering by up to four orders of magnitude compared to the state-of-the-art pre-alignment filters, Shouji, GateKeeper and SHD. For short sequences, SneakySnake accelerates Edlib (state-of-the-art implementation of Myers’s bit-vector algorithm) and Parasail (state-of-the-art sequence aligner with a configurable scoring function), by up to 37.7× and 43.9× (>12× on average), respectively, with its CPU implementation, and by up to 413× and 689× (>400× on average), respectively, with FPGA and GPU acceleration. For long sequences, the CPU implementation of SneakySnake accelerates Parasail and KSW2 (sequence aligner of minimap2) by up to 979× (276.9× on average) and 91.7× (31.7× on average), respectively. As SneakySnake does not replace sequence alignment, users can still obtain all capabilities (e.g. configurable scoring functions) of the aligner of their choice, unlike existing acceleration efforts that sacrifice some aligner capabilities.
Availabilityand implementation
https://github.com/CMU-SAFARI/SneakySnake.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mohammed Alser
- Department of Computer Science, ETH Zurich, Zurich 8006, Switzerland
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich 8006, Switzerland
| | - Taha Shahroodi
- Department of Computer Science, ETH Zurich, Zurich 8006, Switzerland
| | - Juan Gómez-Luna
- Department of Computer Science, ETH Zurich, Zurich 8006, Switzerland
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich 8006, Switzerland
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
| | - Onur Mutlu
- Department of Computer Science, ETH Zurich, Zurich 8006, Switzerland
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich 8006, Switzerland
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| |
Collapse
|
15
|
Firtina C, Kim JS, Alser M, Senol Cali D, Cicek AE, Alkan C, Mutlu O. Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm. Bioinformatics 2020; 36:3669-3679. [PMID: 32167530 DOI: 10.1093/bioinformatics/btaa179] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2019] [Revised: 12/16/2019] [Accepted: 03/11/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject's genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively. RESULTS We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Forward-Backward algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts. AVAILABILITY AND IMPLEMENTATION Source code is available at https://github.com/CMU-SAFARI/Apollo. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Can Firtina
- Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland
| | - Jeremie S Kim
- Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland.,Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Mohammed Alser
- Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland
| | - Damla Senol Cali
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - A Ercument Cicek
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
| | - Onur Mutlu
- Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland.,Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA.,Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
| |
Collapse
|
16
|
Graves CE, Li C, Sheng X, Miller D, Ignowski J, Kiyama L, Strachan JP. In-Memory Computing with Memristor Content Addressable Memories for Pattern Matching. ADVANCED MATERIALS (DEERFIELD BEACH, FLA.) 2020; 32:e2003437. [PMID: 32761709 DOI: 10.1002/adma.202003437] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/19/2020] [Revised: 06/24/2020] [Indexed: 05/04/2023]
Abstract
The dramatic rise of data-intensive workloads has revived application-specific computational hardware for continuing speed and power improvements, frequently achieved by limiting data movement and implementing "in-memory computation". However, conventional complementary metal oxide semiconductor (CMOS) circuit designs can still suffer low power efficiency, motivating designs leveraging nonvolatile resistive random access memory (ReRAM), and with many studies focusing on crossbar circuit architectures. Another circuit primitive-content addressable memory (CAM)-shows great promise for mapping a diverse range of computational models for in-memory computation, with recent ReRAM-CAM designs proposed but few experimentally demonstrated. Here, programming and control of memristors across an 86 × 12 memristor ternary CAM (TCAM) array integrated with CMOS are demonstrated, and parameter tradeoffs for optimizing speed and search margin are evaluated. In addition to smaller area, this memristor TCAM results in significantly lower power due to very low programmable conductance states, motivating CAM use in a wider range of computational applications than conventional TCAMs are confined to today. Finally, the first experimental demonstration of two computational models in memristor TCAM arrays is reported: regular expression matching in a finite state machine for network security intrusion detection and definable inexact pattern matching in a Levenshtein automata for genomic sequencing.
Collapse
Affiliation(s)
- Catherine E Graves
- Hewlett Packard Labs, Hewlett Packard Enterprise, Palo Alto, CA, 94304, USA
| | - Can Li
- Hewlett Packard Labs, Hewlett Packard Enterprise, Palo Alto, CA, 94304, USA
| | - Xia Sheng
- Hewlett Packard Labs, Hewlett Packard Enterprise, Palo Alto, CA, 94304, USA
| | - Darrin Miller
- Silicon Design Lab, Hewlett Packard Enterprise, Fort Collins, CO, 80528, USA
| | - Jim Ignowski
- Silicon Design Lab, Hewlett Packard Enterprise, Fort Collins, CO, 80528, USA
| | - Lennie Kiyama
- Hewlett Packard Labs, Hewlett Packard Enterprise, Palo Alto, CA, 94304, USA
| | - John Paul Strachan
- Hewlett Packard Labs, Hewlett Packard Enterprise, Palo Alto, CA, 94304, USA
| |
Collapse
|
17
|
Alser M, Hassan H, Kumar A, Mutlu O, Alkan C. Shouji: a fast and efficient pre-alignment filter for sequence alignment. Bioinformatics 2020; 35:4255-4263. [PMID: 30923804 DOI: 10.1093/bioinformatics/btz234] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2018] [Revised: 02/27/2019] [Accepted: 03/27/2019] [Indexed: 01/07/2023] Open
Abstract
MOTIVATION The ability to generate massive amounts of sequencing data continues to overwhelm the processing capability of existing algorithms and compute infrastructures. In this work, we explore the use of hardware/software co-design and hardware acceleration to significantly reduce the execution time of short sequence alignment, a crucial step in analyzing sequenced genomes. We introduce Shouji, a highly parallel and accurate pre-alignment filter that remarkably reduces the need for computationally-costly dynamic programming algorithms. The first key idea of our proposed pre-alignment filter is to provide high filtering accuracy by correctly detecting all common subsequences shared between two given sequences. The second key idea is to design a hardware accelerator that adopts modern field-programmable gate array (FPGA) architectures to further boost the performance of our algorithm. RESULTS Shouji significantly improves the accuracy of pre-alignment filtering by up to two orders of magnitude compared to the state-of-the-art pre-alignment filters, GateKeeper and SHD. Our FPGA-based accelerator is up to three orders of magnitude faster than the equivalent CPU implementation of Shouji. Using a single FPGA chip, we benchmark the benefits of integrating Shouji with five state-of-the-art sequence aligners, designed for different computing platforms. The addition of Shouji as a pre-alignment step reduces the execution time of the five state-of-the-art sequence aligners by up to 18.8×. Shouji can be adapted for any bioinformatics pipeline that performs sequence alignment for verification. Unlike most existing methods that aim to accelerate sequence alignment, Shouji does not sacrifice any of the aligner capabilities, as it does not modify or replace the alignment step. AVAILABILITY AND IMPLEMENTATION https://github.com/CMU-SAFARI/Shouji. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mohammed Alser
- Computer Science Department, ETH Zürich, Zürich, Switzerland.,Chair for Processor Design, Center For Advancing Electronics Dresden, Institute of Computer Engineering, Technische Universität Dresden, Dresden, Germany.,Computer Engineering Department, Bilkent University, Ankara, Turkey
| | - Hasan Hassan
- Computer Science Department, ETH Zürich, Zürich, Switzerland
| | - Akash Kumar
- Chair for Processor Design, Center For Advancing Electronics Dresden, Institute of Computer Engineering, Technische Universität Dresden, Dresden, Germany
| | - Onur Mutlu
- Computer Science Department, ETH Zürich, Zürich, Switzerland.,Computer Engineering Department, Bilkent University, Ankara, Turkey
| | - Can Alkan
- Computer Engineering Department, Bilkent University, Ankara, Turkey
| |
Collapse
|
18
|
Namburete EI, Dippenaar A, Conceição EC, Feliciano C, Nascimento MMPD, Peronni KC, Silva WA, Ferro JJ, Harrison LH, Warren RM, Bollela VR. Phylogenomic assessment of drug-resistant Mycobacterium tuberculosis strains from Beira, Mozambique. Tuberculosis (Edinb) 2020; 121:101905. [PMID: 32063558 PMCID: PMC9300053 DOI: 10.1016/j.tube.2020.101905] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2019] [Revised: 12/21/2019] [Accepted: 01/26/2020] [Indexed: 11/24/2022]
Abstract
BACKGROUND Mozambique is a high-burden tuberculosis (TB) country where TB/HIV co-infection and drug resistant TB (DR-TB) incidence is increasing. Whole genome sequencing (WGS) comprehensively describes the molecular epidemiology of TB, allows prediction of DR-TB phenotypes, lineages strains identification and better understanding of transmission chains. OBJECTIVE To describe genetic diversity of DR-TB Mycobacterium tuberculosis isolated in Beira, Mozambique. METHODS Descriptive cross-sectional study with 35 M. tuberculosis isolates, resistant to at least one first-line drug on molecular drug-susceptibility tests (DST). Variant identification, DR prediction and phylogenetic analysis provided by WGS, drug-susceptibility pattern compared to line-probe assay (LPA): Genotype MTBDRTMplus and MTBDRTMsl. FINDINGS Lineage 4 (L4) was the most prevalent: 25 (71.4%) isolates; 5 (14.3%) L1 and 5 (14.3%) L2. WGS showed 33/35 (94.3%) isolates resistant to at least one drug, two pan-susceptible isolates that were previously diagnosed as DR-TB with genotype MTBDRplus. Concordance between WGS and LPA: 88.6% for isoniazid (INH), 85.7% to rifampicin (RPM), 91.4% for quinolones and 100% to second line injectable drugs. There were three possible TB transmission chains, 10 strains showing recent transmission. CONCLUSION WGS provided reliable information about the most frequent lineages related to DR-TB in Beira, Mozambique: L4.3 (LAM), L2 (Beijing) and L1 (EAI) and possible recent transmission chain.
Collapse
Affiliation(s)
- Evangelina Inacio Namburete
- Mycobacteria Research Lab at Clinics Hospital from Ribeirão Preto Medical School, University of São Paulo (FMRP-USP), Brazil; Faculty of Health Science, Catholic University of Mozambique, Beira, Mozambique.
| | - Anzaan Dippenaar
- DST-NRF Centre of Excellence for Biomedical Tuberculosis Research, SAMRC Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University, South Africa.
| | - Emilyn Costa Conceição
- Instituto de Microbiologia Paulo de Góes, Universidade Federal do Rio de Janeiro, Av. Carlos Chagas Filho, 373 - bloco i - Cidade Universitária, Rio de Janeiro, RJ, 21941-970, Brazil.
| | - Cinara Feliciano
- Mycobacteria Research Lab at Clinics Hospital from Ribeirão Preto Medical School, University of São Paulo (FMRP-USP), Brazil.
| | | | - Kamila Chagas Peronni
- Center for Medical Genomics, Clinics Hospital at Ribeirão Preto Medical School, University of São Paulo (FMRP-USP), Brazil.
| | - Wilson Araújo Silva
- Center for Medical Genomics, Clinics Hospital at Ribeirão Preto Medical School, University of São Paulo (FMRP-USP), Brazil; Department of Genetics at Ribeirão Preto Medical School, University of São Paulo (FMRP-USP), Brazil.
| | - Josefo João Ferro
- Faculty of Health Science, Catholic University of Mozambique, Beira, Mozambique.
| | - Lee H Harrison
- Infectious Diseases Epidemiology Research Unit, University of Pittsburgh, Pittsburgh, USA.
| | - Robin Mark Warren
- DST-NRF Centre of Excellence for Biomedical Tuberculosis Research, SAMRC Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University, South Africa.
| | - Valdes Roberto Bollela
- Mycobacteria Research Lab at Clinics Hospital from Ribeirão Preto Medical School, University of São Paulo (FMRP-USP), Brazil.
| |
Collapse
|
19
|
Becker M, Worlikar U, Agrawal S, Schultze H, Ulas T, Singhal S, Schultze JL. Scaling Genomics Data Processing with Memory-Driven Computing to Accelerate Computational Biology. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7295347 DOI: 10.1007/978-3-030-50743-5_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Research is increasingly becoming data-driven, and natural sciences are not an exception. In both biology and medicine, we are observing an exponential growth of structured data collections from experiments and population studies, enabling us to gain novel insights that would otherwise not be possible. However, these growing data sets pose a challenge for existing compute infrastructures since data is outgrowing limits within compute. In this work, we present the application of a novel approach, Memory-Driven Computing (MDC), in the life sciences. MDC proposes a data-centric approach that has been designed for growing data sizes and provides a composable infrastructure for changing workloads. In particular, we show how a typical pipeline for genomics data processing can be accelerated, and application modifications required to exploit this novel architecture. Furthermore, we demonstrate how the isolated evaluation of individual tasks misses significant overheads of typical pipelines in genomics data processing.
Collapse
|
20
|
Senol Cali D, Kim JS, Ghose S, Alkan C, Mutlu O. Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions. Brief Bioinform 2019; 20:1542-1559. [PMID: 29617724 PMCID: PMC6781587 DOI: 10.1093/bib/bby017] [Citation(s) in RCA: 108] [Impact Index Per Article: 21.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2017] [Revised: 02/06/2018] [Indexed: 02/06/2023] Open
Abstract
Nanopore sequencing technology has the potential to render other sequencing technologies obsolete with its ability to generate long reads and provide portability. However, high error rates of the technology pose a challenge while generating accurate genome assemblies. The tools used for nanopore sequence analysis are of critical importance, as they should overcome the high error rates of the technology. Our goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis to understand their advantages, disadvantages and performance bottlenecks. It is important to understand where the current tools do not perform well to develop better tools. To this end, we (1) analyze the multiple steps and the associated tools in the genome assembly pipeline using nanopore sequence data, and (2) provide guidelines for determining the appropriate tools for each step. Based on our analyses, we make four key observations: (1) the choice of the tool for basecalling plays a critical role in overcoming the high error rates of nanopore sequencing technology. (2) Read-to-read overlap finding tools, GraphMap and Minimap, perform similarly in terms of accuracy. However, Minimap has a lower memory usage, and it is faster than GraphMap. (3) There is a trade-off between accuracy and performance when deciding on the appropriate tool for the assembly step. The fast but less accurate assembler Miniasm can be used for quick initial assembly, and further polishing can be applied on top of it to increase the accuracy, which leads to faster overall assembly. (4) The state-of-the-art polishing tool, Racon, generates high-quality consensus sequences while providing a significant speedup over another polishing tool, Nanopolish. We analyze various combinations of different tools and expose the trade-offs between accuracy, performance, memory usage and scalability. We conclude that our observations can guide researchers and practitioners in making conscious and effective choices for each step of the genome assembly pipeline using nanopore sequence data. Also, with the help of bottlenecks we have found, developers can improve the current tools or build new ones that are both accurate and fast, to overcome the high error rates of the nanopore sequencing technology.
Collapse
Affiliation(s)
- Damla Senol Cali
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Jeremie S Kim
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
- Department of Computer Science, Systems Group, ETH Zürich, Zürich, Switzerland
| | - Saugata Ghose
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Bilkent, Ankara, Turkey
| | - Onur Mutlu
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
- Department of Computer Science, Systems Group, ETH Zürich, Zürich, Switzerland
| |
Collapse
|
21
|
Bayat A, Gaëta B, Ignjatovic A, Parameswaran S. Pairwise alignment of nucleotide sequences using maximal exact matches. BMC Bioinformatics 2019; 20:261. [PMID: 31113356 PMCID: PMC6528274 DOI: 10.1186/s12859-019-2827-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2018] [Accepted: 04/17/2019] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Pairwise alignment of short DNA sequences with affine-gap scoring is a common processing step performed in a range of bioinformatics analyses. Dynamic programming (i.e. Smith-Waterman algorithm) is widely used for this purpose. Despite using data level parallelisation, pairwise alignment consumes much time. There are faster alignment algorithms but they suffer from the lack of accuracy. RESULTS In this paper, we present MEM-Align, a fast semi-global alignment algorithm for short DNA sequences that allows for affine-gap scoring and exploit sequence similarity. In contrast to traditional alignment method (such as Smith-Waterman) where individual symbols are aligned, MEM-Align extracts Maximal Exact Matches (MEMs) using a bit-level parallel method and then looks for a subset of MEMs that forms the alignment using a novel dynamic programming method. MEM-Align tries to mimic alignment produced by Smith-Waterman. As a result, for 99.9% of input sequence pair, the computed alignment score is identical to the alignment score computed by Smith-Waterman. Yet MEM-Align is up to 14.5 times faster than the Smith-Waterman algorithm. Fast run-time is achieved by: (a) using a bit-level parallel method to extract MEMs; (b) processing MEMs rather than individual symbols; and, (c) applying heuristics. CONCLUSIONS MEM-Align is a potential candidate to replace other pairwise alignment algorithms used in processes such as DNA read-mapping and Variant-Calling.
Collapse
Affiliation(s)
- Arash Bayat
- School of Computer Science and Engineering, University of New South Wales (UNSW), Sydney, 2052 Australia
- Health and Biosecurity, CSIRO, 53/11 Julius Ave, North Ryde, Sydney, 2113 Australia
| | - Bruno Gaëta
- School of Computer Science and Engineering, University of New South Wales (UNSW), Sydney, 2052 Australia
| | - Aleksandar Ignjatovic
- School of Computer Science and Engineering, University of New South Wales (UNSW), Sydney, 2052 Australia
| | - Sri Parameswaran
- School of Computer Science and Engineering, University of New South Wales (UNSW), Sydney, 2052 Australia
| |
Collapse
|
22
|
Mozafari F, Babashah H, Koohi S, Kavehvash Z. Speeding up DNA sequence alignment by optical correlator. OPTICS & LASER TECHNOLOGY 2018; 108:124-135. [DOI: 10.1016/j.optlastec.2018.06.027] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/19/2023]
|
23
|
Kim JS, Senol Cali D, Xin H, Lee D, Ghose S, Alser M, Hassan H, Ergin O, Alkan C, Mutlu O. GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies. BMC Genomics 2018; 19:89. [PMID: 29764378 PMCID: PMC5954284 DOI: 10.1186/s12864-018-4460-0] [Citation(s) in RCA: 61] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Background Seed location filtering is critical in DNA read mapping, a process where billions of DNA fragments (reads) sampled from a donor are mapped onto a reference genome to identify genomic variants of the donor. State-of-the-art read mappers 1) quickly generate possible mapping locations for seeds (i.e., smaller segments) within each read, 2) extract reference sequences at each of the mapping locations, and 3) check similarity between each read and its associated reference sequences with a computationally-expensive algorithm (i.e., sequence alignment) to determine the origin of the read. A seed location filter comes into play before alignment, discarding seed locations that alignment would deem a poor match. The ideal seed location filter would discard all poor match locations prior to alignment such that there is no wasted computation on unnecessary alignments. Results We propose a novel seed location filtering algorithm, GRIM-Filter, optimized to exploit 3D-stacked memory systems that integrate computation within a logic layer stacked under memory layers, to perform processing-in-memory (PIM). GRIM-Filter quickly filters seed locations by 1) introducing a new representation of coarse-grained segments of the reference genome, and 2) using massively-parallel in-memory operations to identify read presence within each coarse-grained segment. Our evaluations show that for a sequence alignment error tolerance of 0.05, GRIM-Filter 1) reduces the false negative rate of filtering by 5.59x–6.41x, and 2) provides an end-to-end read mapper speedup of 1.81x–3.65x, compared to a state-of-the-art read mapper employing the best previous seed location filtering algorithm. Conclusion GRIM-Filter exploits 3D-stacked memory, which enables the efficient use of processing-in-memory, to overcome the memory bandwidth bottleneck in seed location filtering. We show that GRIM-Filter significantly improves the performance of a state-of-the-art read mapper. GRIM-Filter is a universal seed location filter that can be applied to any read mapper. We hope that our results provide inspiration for new works to design other bioinformatics algorithms that take advantage of emerging technologies and new processing paradigms, such as processing-in-memory using 3D-stacked memory devices.
Collapse
Affiliation(s)
- Jeremie S Kim
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA. .,Department of Computer Science, ETH Zürich, Zürich, CH, Switzerland.
| | - Damla Senol Cali
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Hongyi Xin
- Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | | | - Saugata Ghose
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Mohammed Alser
- Department of Computer Engineering, Bilkent University, Bilkent, Ankara, Turkey
| | - Hasan Hassan
- Department of Computer Science, ETH Zürich, Zürich, CH, Switzerland
| | - Oguz Ergin
- Department of Computer Engineering, TOBB University of Economics and Technology, Sogutozu, Ankara, Turkey
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Bilkent, Ankara, Turkey
| | - Onur Mutlu
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA. .,Department of Computer Science, ETH Zürich, Zürich, CH, Switzerland.
| |
Collapse
|
24
|
Liu D, Bu D, Shi T, Quan J, Wang D, Shi Y, Bo XC, Han W. Biological data processing based on bio-processor unit (BPU), a new concept for next generation computational biology. SCIENCE CHINA-LIFE SCIENCES 2018. [PMID: 29541989 DOI: 10.1007/s11427-018-9278-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Di Liu
- Computational Virology Group, Wuhan Institute of Virology, Chinese Academy of Sciences, Wuhan, 430071, China.
| | - Dongbo Bu
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
| | - Tieliu Shi
- Center for Bioinformatics and Computational Biology, and the Institute of Biomedical Sciences, School of Life Sciences, East China Normal University, Shanghai, 200214, China
| | - Jianxiao Quan
- State Key Laboratory of Mathematical Engineering and Advanced Computation, Wuxi, 214122, China.,Jiangsu Variable Supercomputer Tech Co, Ltd., Wuxi, 214125, China
| | | | - Yongyong Shi
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders (Ministry of Education), Shanghai Jiao Tong University, Shanghai, 200030, China
| | - Xiao-Chen Bo
- Beijing Institute of Radiation Medicine, Beijing, 100850, China
| | - Wenbao Han
- State Key Laboratory of Mathematical Engineering and Advanced Computation, Wuxi, 214122, China.
| |
Collapse
|