Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Xin H, Lee D, Hormozdiari F, Yedkar S, Mutlu O, Alkan C. Accelerating read mapping with FastHASH. BMC Genomics 2013;14 Suppl 1:S13. [PMID: 23369189 PMCID: PMC3549798 DOI: 10.1186/1471-2164-14-s1-s13] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

For:	Xin H, Lee D, Hormozdiari F, Yedkar S, Mutlu O, Alkan C. Accelerating read mapping with FastHASH. BMC Genomics 2013;14 Suppl 1:S13. [PMID: 23369189 PMCID: PMC3549798 DOI: 10.1186/1471-2164-14-s1-s13] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Number

Cited by Other Article(s)

Wei ZG, Bu PY, Zhang XD, Liu F, Qian Y, Wu FX. invMap: a sensitive mapping tool for long noisy reads with inversion structural variants. Bioinformatics 2023;39:btad726. [PMID: 38058196 PMCID: PMC11320709 DOI: 10.1093/bioinformatics/btad726] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Revised: 11/02/2023] [Accepted: 12/05/2023] [Indexed: 12/08/2023] Open

Langarita R, Armejach A, Ibanez P, Alastruey-Benede J, Moreto M. Porting and Optimizing BWA-MEM2 Using the Fujitsu A64FX Processor. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023;20:3139-3153. [PMID: 37018085 DOI: 10.1109/tcbb.2023.3264514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]

Diab S, Nassereldine A, Alser M, Gómez Luna J, Mutlu O, El Hajj I. A framework for high-throughput sequence alignment using real processing-in-memory systems. Bioinformatics 2023;39:btad155. [PMID: 36971586 PMCID: PMC10159653 DOI: 10.1093/bioinformatics/btad155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Revised: 02/24/2023] [Accepted: 03/25/2023] [Indexed: 03/29/2023] Open

Firtina C, Park J, Alser M, Kim JS, Cali D, Shahroodi T, Ghiasi N, Singh G, Kanellopoulos K, Alkan C, Mutlu O. BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis. NAR Genom Bioinform 2023;5:lqad004. [PMID: 36685727 PMCID: PMC9853099 DOI: 10.1093/nargab/lqad004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 12/16/2022] [Accepted: 01/10/2023] [Indexed: 01/22/2023] Open

Yu C, Zhao Y, Zhao C, Ma H, Wang G. DiagAF: A More Accurate and Efficient Pre-Alignment Filter for Sequence Alignment. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022;19:3404-3415. [PMID: 34780330 DOI: 10.1109/tcbb.2021.3127879] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]

Lopez JO, Seguel J, Chamorro A, Ramos KS. Pattern matching for high precision detection of LINE-1s in human genomes. BMC Bioinformatics 2022;23:375. [PMID: 36100885 PMCID: PMC9472350 DOI: 10.1186/s12859-022-04907-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2021] [Accepted: 08/05/2022] [Indexed: 08/30/2023] Open

Abstract

Background

Long interspersed element 1 (LINE-1 or L1) retrotransposons are mobile elements that constitute 17–20% of the human genome. Strong correlations between abnormal L1 expression and several human diseases have been reported. This has motivated increasing interest in accurate quantification of the number of L1 copies present in any given biologic specimen. A main obstacle toward this aim is that L1s are relatively long DNA segments with regions of high variability, or largely present in the human genome as truncated fragments. These particularities render traditional alignment strategies, such as seed-and-extend inefficient, as the number of segments that are similar to L1s explodes exponentially. This study uses the pattern matching methodology for more accurate identification of L1s. We validate experimentally the superiority of pattern matching for L1 detection over alternative methods and discuss some of its potential applications.

Results

Pattern matching detected full-length L1 copies with high precision, reasonable computational time, and no prior input information. It also detected truncated and significantly altered copies of L1 with relatively high precision. The method was effectively used to annotate L1s in a target genome and to calculate copy number variation with respect to a reference genome. Crucial to the success of implementation was the selection of a small set of k-mer probes from a set of sequences presenting a stable pattern of distribution in the genome. As in seed-and-extend methods, the pattern matching algorithm sowed these k-mer probes, but instead of using heuristic extensions around the seeds, the analysis was based on distribution patterns within the genome. The desired level of precision could be adjusted, with some loss of recall.

Conclusion

Pattern matching is more efficient than seed-and-extend methods for the detection of L1 segments whose characterization depends on a finite set of sequences with common areas of low variability. We propose that pattern matching may help establish correlations between L1 copy number and disease states associated with L1 mobilization and evolution.

Collapse

Alser M, Lindegger J, Firtina C, Almadhoun N, Mao H, Singh G, Gomez-Luna J, Mutlu O. From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures. Comput Struct Biotechnol J 2022;20:4579-4599. [PMID: 36090814 PMCID: PMC9436709 DOI: 10.1016/j.csbj.2022.08.019] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Revised: 08/08/2022] [Accepted: 08/08/2022] [Indexed: 02/01/2023] Open

Langarita R, Armejach A, Setoain J, Ibanez-Marin P, Alastruey-Benede J, Moreto M. Compressed Sparse FM-Index: Fast Sequence Alignment Using Large K-Steps. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022;19:355-368. [PMID: 32750858 DOI: 10.1109/tcbb.2020.3000253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]

Alser M, Rotman J, Deshpande D, Taraszka K, Shi H, Baykal PI, Yang HT, Xue V, Knyazev S, Singer BD, Balliu B, Koslicki D, Skums P, Zelikovsky A, Alkan C, Mutlu O, Mangul S. Technology dictates algorithms: recent developments in read alignment. Genome Biol 2021;22:249. [PMID: 34446078 PMCID: PMC8390189 DOI: 10.1186/s13059-021-02443-7] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 07/28/2021] [Indexed: 01/08/2023] Open

Affiliation(s)

Mohammed Alser Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
Jeremy Rotman Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
Dhrithi Deshpande Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA
Kodi Taraszka Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
Huwenbo Shi Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
Pelin Icer Baykal Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
Harry Taegyun Yang Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA Bioinformatics Interdepartmental Ph.D. Program, University of California Los Angeles, Los Angeles, CA, 90095, USA
Victor Xue Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
Sergey Knyazev Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
Benjamin D Singer Division of Pulmonary and Critical Care Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA Department of Biochemistry & Molecular Genetics, Northwestern University Feinberg School of Medicine, Chicago, USA Simpson Querrey Institute for Epigenetics, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
Brunilda Balliu Department of Computational Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
David Koslicki Computer Science and Engineering, Pennsylvania State University, University Park, PA, 16801, USA Biology Department, Pennsylvania State University, University Park, PA, 16801, USA The Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, 16801, USA
Pavel Skums Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
Alex Zelikovsky Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, 119991, Russia
Can Alkan Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey Bilkent-Hacettepe Health Sciences and Technologies Program, Ankara, Turkey
Onur Mutlu Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
Serghei Mangul Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA.

Collapse

Yan Y, Chaturvedi N, Appuswamy R. Accel-Align: a fast sequence mapper and aligner based on the seed-embed-extend method. BMC Bioinformatics 2021;22:257. [PMID: 34016035 PMCID: PMC8139006 DOI: 10.1186/s12859-021-04162-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Accepted: 05/04/2021] [Indexed: 12/30/2022] Open

Abstract

Background

Improvements in sequencing technology continue to drive sequencing cost towards $100 per genome. However, mapping sequenced data to a reference genome remains a computationally-intensive task due to the dependence on edit distance for dealing with INDELs and mismatches introduced by sequencing. All modern aligners use seed–filter–extend methodology and rely on filtration heuristics to reduce the overhead of edit distance computation. However, filtering has inherent performance–accuracy trade-offs that limits its effectiveness.

Results

Motivated by algorithmic advances in randomized low-distortion embedding, we introduce SEE, a new methodology for developing sequence mappers and aligners. While SFE focuses on eliminating sub-optimal candidates, SEE focuses instead on identifying optimal candidates. To do so, SEE transforms the read and reference strings from edit distance regime to the Hamming regime by embedding them using a randomized algorithm, and uses Hamming distance over the embedded set to identify optimal candidates. To show that SEE performs well in practice, we present Accel-Align an SEE-based short-read sequence mapper and aligner that is 3–12\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document}× faster than state-of-the-art aligners on commodity CPUs, without any special-purpose hardware, while providing comparable accuracy.

Conclusions

As sequencing technologies continue to increase read length while improving throughput and accuracy, we believe that randomized embeddings open up new avenues for optimization that cannot be achieved by using edit distance. Thus, the techniques presented in this paper have a much broader scope as they can be used for other applications like graph alignment, multiple sequence alignment, and sequence assembly.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12859-021-04162-z.

Collapse

Alser M, Shahroodi T, Gómez-Luna J, Alkan C, Mutlu O. SneakySnake: a fast and accurate universal genome pre-alignment filter for CPUs, GPUs and FPGAs. Bioinformatics 2020;36:5282-5290. [PMID: 33315064 DOI: 10.1093/bioinformatics/btaa1015] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Revised: 09/30/2020] [Accepted: 11/24/2020] [Indexed: 11/14/2022] Open

Abstract Abstract Motivation We introduce SneakySnake, a highly parallel and highly accurate pre-alignment filter that remarkably reduces the need for computationally costly sequence alignment. The key idea of SneakySnake is to reduce the approximate string matching (ASM) problem to the single net routing (SNR) problem in VLSI chip layout. In the SNR problem, we are interested in finding the optimal path that connects two terminals with the least routing cost on a special grid layout that contains obstacles. The SneakySnake algorithm quickly solves the SNR problem and uses the found optimal path to decide whether or not performing sequence alignment is necessary. Reducing the ASM problem into SNR also makes SneakySnake efficient to implement on CPUs, GPUs and FPGAs. Results SneakySnake significantly improves the accuracy of pre-alignment filtering by up to four orders of magnitude compared to the state-of-the-art pre-alignment filters, Shouji, GateKeeper and SHD. For short sequences, SneakySnake accelerates Edlib (state-of-the-art implementation of Myers’s bit-vector algorithm) and Parasail (state-of-the-art sequence aligner with a configurable scoring function), by up to 37.7× and 43.9× (>12× on average), respectively, with its CPU implementation, and by up to 413× and 689× (>400× on average), respectively, with FPGA and GPU acceleration. For long sequences, the CPU implementation of SneakySnake accelerates Parasail and KSW2 (sequence aligner of minimap2) by up to 979× (276.9× on average) and 91.7× (31.7× on average), respectively. As SneakySnake does not replace sequence alignment, users can still obtain all capabilities (e.g. configurable scoring functions) of the aligner of their choice, unlike existing acceleration efforts that sacrifice some aligner capabilities. Availabilityand implementation https://github.com/CMU-SAFARI/SneakySnake. Supplementary information Supplementary data are available at Bioinformatics online. Collapse

Firtina C, Kim JS, Alser M, Senol Cali D, Cicek AE, Alkan C, Mutlu O. Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm. Bioinformatics 2020;36:3669-3679. [PMID: 32167530 DOI: 10.1093/bioinformatics/btaa179] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2019] [Revised: 12/16/2019] [Accepted: 03/11/2020] [Indexed: 11/13/2022] Open

Abstract

MOTIVATION

Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject's genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively.

RESULTS

We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Forward-Backward algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts.

AVAILABILITY AND IMPLEMENTATION

Source code is available at https://github.com/CMU-SAFARI/Apollo.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Wei ZG, Zhang SW, Liu F. smsMap: mapping single molecule sequencing reads by locating the alignment starting positions. BMC Bioinformatics 2020;21:341. [PMID: 32753028 PMCID: PMC7430848 DOI: 10.1186/s12859-020-03698-w] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2020] [Accepted: 07/23/2020] [Indexed: 01/09/2023] Open

Alser M, Hassan H, Kumar A, Mutlu O, Alkan C. Shouji: a fast and efficient pre-alignment filter for sequence alignment. Bioinformatics 2020;35:4255-4263. [PMID: 30923804 DOI: 10.1093/bioinformatics/btz234] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2018] [Revised: 02/27/2019] [Accepted: 03/27/2019] [Indexed: 01/07/2023] Open

Abstract

MOTIVATION

The ability to generate massive amounts of sequencing data continues to overwhelm the processing capability of existing algorithms and compute infrastructures. In this work, we explore the use of hardware/software co-design and hardware acceleration to significantly reduce the execution time of short sequence alignment, a crucial step in analyzing sequenced genomes. We introduce Shouji, a highly parallel and accurate pre-alignment filter that remarkably reduces the need for computationally-costly dynamic programming algorithms. The first key idea of our proposed pre-alignment filter is to provide high filtering accuracy by correctly detecting all common subsequences shared between two given sequences. The second key idea is to design a hardware accelerator that adopts modern field-programmable gate array (FPGA) architectures to further boost the performance of our algorithm.

RESULTS

Shouji significantly improves the accuracy of pre-alignment filtering by up to two orders of magnitude compared to the state-of-the-art pre-alignment filters, GateKeeper and SHD. Our FPGA-based accelerator is up to three orders of magnitude faster than the equivalent CPU implementation of Shouji. Using a single FPGA chip, we benchmark the benefits of integrating Shouji with five state-of-the-art sequence aligners, designed for different computing platforms. The addition of Shouji as a pre-alignment step reduces the execution time of the five state-of-the-art sequence aligners by up to 18.8×. Shouji can be adapted for any bioinformatics pipeline that performs sequence alignment for verification. Unlike most existing methods that aim to accelerate sequence alignment, Shouji does not sacrifice any of the aligner capabilities, as it does not modify or replace the alignment step.

AVAILABILITY AND IMPLEMENTATION

https://github.com/CMU-SAFARI/Shouji.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Xin H, Shao M, Kingsford C. Context-aware seeds for read mapping. Algorithms Mol Biol 2020;15:10. [PMID: 32489399 PMCID: PMC7245042 DOI: 10.1186/s13015-020-00172-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2019] [Accepted: 05/15/2020] [Indexed: 11/10/2022] Open

Shen F, Kidd JM. Rapid, Paralog-Sensitive CNV Analysis of 2457 Human Genomes Using QuicK-mer2. Genes (Basel) 2020;11:genes11020141. [PMID: 32013076 PMCID: PMC7073954 DOI: 10.3390/genes11020141] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2020] [Revised: 01/21/2020] [Accepted: 01/24/2020] [Indexed: 12/22/2022] Open

Haghshenas E, Sahinalp SC, Hach F. lordFAST: sensitive and Fast Alignment Search Tool for LOng noisy Read sequencing Data. Bioinformatics 2019;35:20-27. [PMID: 30561550 DOI: 10.1093/bioinformatics/bty544] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2017] [Accepted: 06/28/2018] [Indexed: 02/01/2023] Open

Abstract

Motivation

Recent advances in genomics and precision medicine have been made possible through the application of high throughput sequencing (HTS) to large collections of human genomes. Although HTS technologies have proven their use in cataloging human genome variation, computational analysis of the data they generate is still far from being perfect. The main limitation of Illumina and other popular sequencing technologies is their short read length relative to the lengths of (common) genomic repeats. Newer (single molecule sequencing - SMS) technologies such as Pacific Biosciences and Oxford Nanopore are producing longer reads, making it theoretically possible to overcome the difficulties imposed by repeat regions. Unfortunately, because of their high sequencing error rate, reads generated by these technologies are very difficult to work with and cannot be used in many of the standard downstream analysis pipelines. Note that it is not only difficult to find the correct mapping locations of such reads in a reference genome, but also to establish their correct alignment so as to differentiate sequencing errors from real genomic variants. Furthermore, especially since newer SMS instruments provide higher throughput, mapping and alignment need to be performed much faster than before, maintaining high sensitivity.

Results

We introduce lordFAST, a novel long-read mapper that is specifically designed to align reads generated by PacBio and potentially other SMS technologies to a reference. lordFAST not only has higher sensitivity than the available alternatives, it is also among the fastest and has a very low memory footprint.

Availability and implementation

lordFAST is implemented in C++ and supports multi-threading. The source code of lordFAST is available at https://github.com/vpc-ccg/lordfast.

Supplementary information

Supplementary data are available at Bioinformatics online.

Collapse

Salavati M, Bush SJ, Palma-Vera S, McCulloch MEB, Hume DA, Clark EL. Elimination of Reference Mapping Bias Reveals Robust Immune Related Allele-Specific Expression in Crossbred Sheep. Front Genet 2019;10:863. [PMID: 31608110 PMCID: PMC6761296 DOI: 10.3389/fgene.2019.00863] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2019] [Accepted: 08/19/2019] [Indexed: 12/13/2022] Open

Abstract

Pervasive allelic variation at both gene and single nucleotide level (SNV) between individuals is commonly associated with complex traits in humans and animals. Allele-specific expression (ASE) analysis, using RNA-Seq, can provide a detailed annotation of allelic imbalance and infer the existence of cis-acting transcriptional regulation. However, variant detection in RNA-Seq data is compromised by biased mapping of reads to the reference DNA sequence. In this manuscript, we describe an unbiased standardized computational pipeline for allele-specific expression analysis using RNA-Seq data, which we have adapted and developed using tools available under open license. The analysis pipeline we present is designed to minimize reference bias while providing accurate profiling of allele-specific expression across tissues and cell types. Using this methodology, we were able to profile pervasive allelic imbalance across tissues and cell types, at both the gene and SNV level, in Texel×Scottish Blackface sheep, using the sheep gene expression atlas data set. ASE profiles were pervasive in each sheep and across all tissue types investigated. However, ASE profiles shared across tissues were limited, and instead, they tended to be highly tissue-specific. These tissue-specific ASE profiles may underlie the expression of economically important traits and could be utilized as weighted SNVs, for example, to improve the accuracy of genomic selection in breeding programs for sheep. An additional benefit of the pipeline is that it does not require parental genotypes and can therefore be applied to other RNA-Seq data sets for livestock, including those available on the Functional Annotation of Animal Genomes (FAANG) data portal. This study is the first global characterization of moderate to extreme ASE in tissues and cell types from sheep. We have applied a robust methodology for ASE profiling to provide both a novel analysis of the multi-dimensional sheep gene expression atlas data set and a foundation for identifying the regulatory and expressed elements of the genome that are driving complex traits in livestock.

Collapse

Senol Cali D, Kim JS, Ghose S, Alkan C, Mutlu O. Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions. Brief Bioinform 2019;20:1542-1559. [PMID: 29617724 PMCID: PMC6781587 DOI: 10.1093/bib/bby017] [Citation(s) in RCA: 108] [Impact Index Per Article: 21.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2017] [Revised: 02/06/2018] [Indexed: 02/06/2023] Open

Abstract

Nanopore sequencing technology has the potential to render other sequencing technologies obsolete with its ability to generate long reads and provide portability. However, high error rates of the technology pose a challenge while generating accurate genome assemblies. The tools used for nanopore sequence analysis are of critical importance, as they should overcome the high error rates of the technology. Our goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis to understand their advantages, disadvantages and performance bottlenecks. It is important to understand where the current tools do not perform well to develop better tools. To this end, we (1) analyze the multiple steps and the associated tools in the genome assembly pipeline using nanopore sequence data, and (2) provide guidelines for determining the appropriate tools for each step. Based on our analyses, we make four key observations: (1) the choice of the tool for basecalling plays a critical role in overcoming the high error rates of nanopore sequencing technology. (2) Read-to-read overlap finding tools, GraphMap and Minimap, perform similarly in terms of accuracy. However, Minimap has a lower memory usage, and it is faster than GraphMap. (3) There is a trade-off between accuracy and performance when deciding on the appropriate tool for the assembly step. The fast but less accurate assembler Miniasm can be used for quick initial assembly, and further polishing can be applied on top of it to increase the accuracy, which leads to faster overall assembly. (4) The state-of-the-art polishing tool, Racon, generates high-quality consensus sequences while providing a significant speedup over another polishing tool, Nanopolish. We analyze various combinations of different tools and expose the trade-offs between accuracy, performance, memory usage and scalability. We conclude that our observations can guide researchers and practitioners in making conscious and effective choices for each step of the genome assembly pipeline using nanopore sequence data. Also, with the help of bottlenecks we have found, developers can improve the current tools or build new ones that are both accurate and fast, to overcome the high error rates of the nanopore sequencing technology.

Collapse

Mozafari F, Babashah H, Koohi S, Kavehvash Z. Speeding up DNA sequence alignment by optical correlator. OPTICS & LASER TECHNOLOGY 2018;108:124-135. [DOI: 10.1016/j.optlastec.2018.06.027] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/19/2023]

Alser M, Hassan H, Xin H, Ergin O, Mutlu O, Alkan C. GateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping. Bioinformatics 2018;33:3355-3363. [PMID: 28575161 DOI: 10.1093/bioinformatics/btx342] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2016] [Accepted: 05/29/2017] [Indexed: 01/06/2023] Open

Abstract

Motivation

High throughput DNA sequencing (HTS) technologies generate an excessive number of small DNA segments -called short reads- that cause significant computational burden. To analyze the entire genome, each of the billions of short reads must be mapped to a reference genome based on the similarity between a read and 'candidate' locations in that reference genome. The similarity measurement, called alignment, formulated as an approximate string matching problem, is the computational bottleneck because: (i) it is implemented using quadratic-time dynamic programming algorithms and (ii) the majority of candidate locations in the reference genome do not align with a given read due to high dissimilarity. Calculating the alignment of such incorrect candidate locations consumes an overwhelming majority of a modern read mapper's execution time. Therefore, it is crucial to develop a fast and effective filter that can detect incorrect candidate locations and eliminate them before invoking computationally costly alignment algorithms.

Results

We propose GateKeeper, a new hardware accelerator that functions as a pre-alignment step that quickly filters out most incorrect candidate locations. GateKeeper is the first design to accelerate pre-alignment using Field-Programmable Gate Arrays (FPGAs), which can perform pre-alignment much faster than software. When implemented on a single FPGA chip, GateKeeper maintains high accuracy (on average >96%) while providing, on average, 90-fold and 130-fold speedup over the state-of-the-art software pre-alignment techniques, Adjacency Filter and Shifted Hamming Distance (SHD), respectively. The addition of GateKeeper as a pre-alignment step can reduce the verification time of the mrFAST mapper by a factor of 10.

Availability and implementation

https://github.com/BilkentCompGen/GateKeeper.

Contact

mohammedalser@bilkent.edu.tr or onur.mutlu@inf.ethz.ch or calkan@cs.bilkent.edu.tr.

Supplementary information

Supplementary data are available at Bioinformatics online.

Collapse

Kim JS, Senol Cali D, Xin H, Lee D, Ghose S, Alser M, Hassan H, Ergin O, Alkan C, Mutlu O. GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies. BMC Genomics 2018;19:89. [PMID: 29764378 PMCID: PMC5954284 DOI: 10.1186/s12864-018-4460-0] [Citation(s) in RCA: 61] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open

Abstract

Background

Seed location filtering is critical in DNA read mapping, a process where billions of DNA fragments (reads) sampled from a donor are mapped onto a reference genome to identify genomic variants of the donor. State-of-the-art read mappers 1) quickly generate possible mapping locations for seeds (i.e., smaller segments) within each read, 2) extract reference sequences at each of the mapping locations, and 3) check similarity between each read and its associated reference sequences with a computationally-expensive algorithm (i.e., sequence alignment) to determine the origin of the read. A seed location filter comes into play before alignment, discarding seed locations that alignment would deem a poor match. The ideal seed location filter would discard all poor match locations prior to alignment such that there is no wasted computation on unnecessary alignments.

Results

We propose a novel seed location filtering algorithm, GRIM-Filter, optimized to exploit 3D-stacked memory systems that integrate computation within a logic layer stacked under memory layers, to perform processing-in-memory (PIM). GRIM-Filter quickly filters seed locations by 1) introducing a new representation of coarse-grained segments of the reference genome, and 2) using massively-parallel in-memory operations to identify read presence within each coarse-grained segment. Our evaluations show that for a sequence alignment error tolerance of 0.05, GRIM-Filter 1) reduces the false negative rate of filtering by 5.59x–6.41x, and 2) provides an end-to-end read mapper speedup of 1.81x–3.65x, compared to a state-of-the-art read mapper employing the best previous seed location filtering algorithm.

Conclusion

GRIM-Filter exploits 3D-stacked memory, which enables the efficient use of processing-in-memory, to overcome the memory bandwidth bottleneck in seed location filtering. We show that GRIM-Filter significantly improves the performance of a state-of-the-art read mapper. GRIM-Filter is a universal seed location filter that can be applied to any read mapper. We hope that our results provide inspiration for new works to design other bioinformatics algorithms that take advantage of emerging technologies and new processing paradigms, such as processing-in-memory using 3D-stacked memory devices.

Collapse

Cheng H, Zhang Y, Xu Y. BitMapper2: a GPU-accelerated all-mapper based on the sparse q-gram index. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018;16:886-897. [PMID: 29993660 DOI: 10.1109/tcbb.2018.2822687] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]

Zhang H, Chan Y, Fan K, Schmidt B, Liu W. Fast and efficient short read mapping based on a succinct hash index. BMC Bioinformatics 2018. [PMID: 29523083 PMCID: PMC5845352 DOI: 10.1186/s12859-018-2094-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open

Yin Z, Lan H, Tan G, Lu M, Vasilakos AV, Liu W. Computing Platforms for Big Biological Data Analytics: Perspectives and Challenges. Comput Struct Biotechnol J 2017;15:403-411. [PMID: 28883909 PMCID: PMC5581845 DOI: 10.1016/j.csbj.2017.07.004] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2017] [Revised: 06/30/2017] [Accepted: 07/28/2017] [Indexed: 12/25/2022] Open

Almutairy M, Torng E. The effects of sampling on the efficiency and accuracy of k-mer indexes: Theoretical and empirical comparisons using the human genome. PLoS One 2017;12:e0179046. [PMID: 28686614 PMCID: PMC5501444 DOI: 10.1371/journal.pone.0179046] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2016] [Accepted: 05/23/2017] [Indexed: 01/11/2023] Open

Canzar S, Salzberg SL. Short Read Mapping: An Algorithmic Tour. PROCEEDINGS OF THE IEEE. INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS 2017;105:436-458. [PMID: 28502990 PMCID: PMC5425171 DOI: 10.1109/jproc.2015.2455551] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]

Kockan C, Hach F, Sarrafi I, Bell RH, McConeghy B, Beja K, Haegert A, Wyatt AW, Volik SV, Chi KN, Collins CC, Sahinalp SC. SiNVICT: ultra-sensitive detection of single nucleotide variants and indels in circulating tumour DNA. Bioinformatics 2016;33:26-34. [PMID: 27531099 DOI: 10.1093/bioinformatics/btw536] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2015] [Revised: 08/09/2016] [Accepted: 08/11/2016] [Indexed: 01/05/2023] Open

Abstract

MOTIVATION

Successful development and application of precision oncology approaches require robust elucidation of the genomic landscape of a patient's cancer and, ideally, the ability to monitor therapy-induced genomic changes in the tumour in an inexpensive and minimally invasive manner. Thanks to recent advances in sequencing technologies, 'liquid biopsy', the sampling of patient's bodily fluids such as blood and urine, is considered as one of the most promising approaches to achieve this goal. In many cancer patients, and especially those with advanced metastatic disease, deep sequencing of circulating cell free DNA (cfDNA) obtained from patient's blood yields a mixture of reads originating from the normal DNA and from multiple tumour subclones-called circulating tumour DNA or ctDNA. The ctDNA/cfDNA ratio as well as the proportion of ctDNA originating from specific tumour subclones depend on multiple factors, making comprehensive detection of mutations difficult, especially at early stages of cancer. Furthermore, sensitive and accurate detection of single nucleotide variants (SNVs) and indels from cfDNA is constrained by several factors such as the sequencing errors and PCR artifacts, and mapping errors related to repeat regions within the genome. In this article, we introduce SiNVICT, a computational method that increases the sensitivity and specificity of SNV and indel detection at very low variant allele frequencies. SiNVICT has the capability to handle multiple sequencing platforms with different error properties; it minimizes false positives resulting from mapping errors and other technology specific artifacts including strand bias and low base quality at read ends. SiNVICT also has the capability to perform time-series analysis, where samples from a patient sequenced at multiple time points are jointly examined to report locations of interest where there is a possibility that certain clones were wiped out by some treatment while some subclones gained selective advantage.

RESULTS

We tested SiNVICT on simulated data as well as prostate cancer cell lines and cfDNA obtained from castration-resistant prostate cancer patients. On both simulated and biological data, SiNVICT was able to detect SNVs and indels with variant allele percentages as low as 0.5%. The lowest amounts of total DNA used for the biological data where SNVs and indels could be detected with very high sensitivity were 2.5 ng on the Ion Torrent platform and 10 ng on Illumina. With increased sequencing and mapping accuracy, SiNVICT might be utilized in clinical settings, making it possible to track the progress of point mutations and indels that are associated with resistance to cancer therapies and provide patients personalized treatment. We also compared SiNVICT with other popular SNV callers such as MuTect, VarScan2 and Freebayes. Our results show that SiNVICT performs better than these tools in most cases and allows further data exploration such as time-series analysis on cfDNA sequencing data.

AVAILABILITY AND IMPLEMENTATION

SiNVICT is available at: https://sfu-compbio.github.io/sinvictSupplementary information: Supplementary data are available at Bioinformatics online.

CONTACT

cenk@sfu.ca.

Collapse

Oetjens MT, Shen F, Emery SB, Zou Z, Kidd JM. Y-Chromosome Structural Diversity in the Bonobo and Chimpanzee Lineages. Genome Biol Evol 2016;8:2231-40. [PMID: 27358426 PMCID: PMC4987114 DOI: 10.1093/gbe/evw150] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open

Roy S, LaFramboise WA, Nikiforov YE, Nikiforova MN, Routbort MJ, Pfeifer J, Nagarajan R, Carter AB, Pantanowitz L. Next-Generation Sequencing Informatics: Challenges and Strategies for Implementation in a Clinical Environment. Arch Pathol Lab Med 2016;140:958-75. [PMID: 26901284 DOI: 10.5858/arpa.2015-0507-ra] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]

Xin H, Nahar S, Zhu R, Emmons J, Pekhimenko G, Kingsford C, Alkan C, Mutlu O. Optimal seed solver: optimizing seed selection in read mapping. ACTA ACUST UNITED AC 2015;32:1632-42. [PMID: 26568624 DOI: 10.1093/bioinformatics/btv670] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2015] [Accepted: 11/09/2015] [Indexed: 11/12/2022]

Cheng H, Jiang H, Yang J, Xu Y, Shang Y. BitMapper: an efficient all-mapper based on bit-vector computing. BMC Bioinformatics 2015;16:192. [PMID: 26063651 PMCID: PMC4462005 DOI: 10.1186/s12859-015-0626-9] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2014] [Accepted: 05/22/2015] [Indexed: 11/10/2022] Open

Royer-Bertrand B, Rivolta C. Whole genome sequencing as a means to assess pathogenic mutations in medical genetics and cancer. Cell Mol Life Sci 2015;72:1463-71. [PMID: 25548800 PMCID: PMC11113357 DOI: 10.1007/s00018-014-1807-9] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2014] [Revised: 12/12/2014] [Accepted: 12/15/2014] [Indexed: 12/17/2022]

Xin H, Greth J, Emmons J, Pekhimenko G, Kingsford C, Alkan C, Mutlu O. Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping. ACTA ACUST UNITED AC 2015;31:1553-60. [PMID: 25577434 DOI: 10.1093/bioinformatics/btu856] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2014] [Accepted: 12/23/2014] [Indexed: 11/13/2022]

Affiliation(s)

Hongyi Xin Computer Science Department, Department of Electrical and Computer Engineering, Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA and Department of Computer Engineering, Bilkent University, Bilkent, Ankara 06800, Turkey
John Greth Computer Science Department, Department of Electrical and Computer Engineering, Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA and Department of Computer Engineering, Bilkent University, Bilkent, Ankara 06800, Turkey
John Emmons Computer Science Department, Department of Electrical and Computer Engineering, Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA and Department of Computer Engineering, Bilkent University, Bilkent, Ankara 06800, Turkey
Gennady Pekhimenko Computer Science Department, Department of Electrical and Computer Engineering, Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA and Department of Computer Engineering, Bilkent University, Bilkent, Ankara 06800, Turkey
Carl Kingsford Computer Science Department, Department of Electrical and Computer Engineering, Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA and Department of Computer Engineering, Bilkent University, Bilkent, Ankara 06800, Turkey
Can Alkan Computer Science Department, Department of Electrical and Computer Engineering, Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA and Department of Computer Engineering, Bilkent University, Bilkent, Ankara 06800, Turkey
Onur Mutlu Computer Science Department, Department of Electrical and Computer Engineering, Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA and Department of Computer Engineering, Bilkent University, Bilkent, Ankara 06800, Turkey

Collapse

Lee D, Hormozdiari F, Xin H, Hach F, Mutlu O, Alkan C. Fast and accurate mapping of Complete Genomics reads. Methods 2014;79-80:3-10. [PMID: 25461772 DOI: 10.1016/j.ymeth.2014.10.012] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2014] [Revised: 10/01/2014] [Accepted: 10/13/2014] [Indexed: 12/31/2022] Open

Sequence alignment tools: one parallel pattern to rule them all? BIOMED RESEARCH INTERNATIONAL 2014;2014:539410. [PMID: 25147803 PMCID: PMC4131566 DOI: 10.1155/2014/539410] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/07/2014] [Revised: 06/03/2014] [Accepted: 06/21/2014] [Indexed: 11/17/2022]

Hach F, Sarrafi I, Hormozdiari F, Alkan C, Eichler EE, Sahinalp SC. mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications. Nucleic Acids Res 2014;42:W494-500. [PMID: 24810850 PMCID: PMC4086126 DOI: 10.1093/nar/gku370] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open

Abstract

High throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for processing and downstream analysis. While tools that report the ‘best’ mapping location of each read provide a fast way to process HTS data, they are not suitable for many types of downstream analysis such as structural variation detection, where it is important to report multiple mapping loci for each read. For this purpose we introduce mrsFAST-Ultra, a fast, cache oblivious, SNP-aware aligner that can handle the multi-mapping of HTS reads very efficiently. mrsFAST-Ultra improves mrsFAST, our first cache oblivious read aligner capable of handling multi-mapping reads, through new and compact index structures that reduce not only the overall memory usage but also the number of CPU operations per alignment. In fact the size of the index generated by mrsFAST-Ultra is 10 times smaller than that of mrsFAST. As importantly, mrsFAST-Ultra introduces new features such as being able to (i) obtain the best mapping loci for each read, and (ii) return all reads that have at most n mapping loci (within an error threshold), together with these loci, for any user specified n. Furthermore, mrsFAST-Ultra is SNP-aware, i.e. it can map reads to reference genome while discounting the mismatches that occur at common SNP locations provided by db-SNP; this significantly increases the number of reads that can be mapped to the reference genome. Notice that all of the above features are implemented within the index structure and are not simple post-processing steps and thus are performed highly efficiently. Finally, mrsFAST-Ultra utilizes multiple available cores and processors and can be tuned for various memory settings. Our results show that mrsFAST-Ultra is roughly five times faster than its predecessor mrsFAST. In comparison to newly enhanced popular tools such as Bowtie2, it is more sensitive (it can report 10 times or more mappings per read) and much faster (six times or more) in the multi-mapping mode. Furthermore, mrsFAST-Ultra has an index size of 2GB for the entire human reference genome, which is roughly half of that of Bowtie2. mrsFAST-Ultra is open source and it can be accessed at http://mrsfast.sourceforge.net.

Collapse

Gonzalez-Garay ML. The road from next-generation sequencing to personalized medicine. Per Med 2014;11:523-544. [PMID: 26000024 PMCID: PMC4437232 DOI: 10.2217/pme.14.34] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]