1
|
Yu C, Zhao Y, Zhao C, Jin J, Mao K, Wang G. MiniDBG: A Novel and Minimal De Bruijn Graph for Read Mapping. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:129-142. [PMID: 38060353 DOI: 10.1109/tcbb.2023.3340251] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2024]
Abstract
The De Bruijn graph (DBG) has been widely used in the algorithms for indexing or organizing read and reference sequences in bioinformatics. However, a DBG model that can locate each node, edge and path on sequence has not been proposed so far. Recently, DBG has been used for representing reference sequences in read mapping tasks. In this process, it is not a one-to-one correspondence between the paths of DBG and the substrings of reference sequence. This results in the false path on DBG, which means no substrings of reference producing the path. Moreover, if a candidate path of a read is true, we need to locate it and verify the candidate on sequence. To solve these problems, we proposed a DBG model, called MiniDBG, which stores the position lists of a minimal set of edges. With the position lists, MiniDBG can locate any node, edge and path efficiently. We also proposed algorithms for generating MiniDBG based on an original DBG and algorithms for locating edges or paths on sequence. We designed and ran experiments on real datasets for comparing them with BWT-based and position list-based methods. The experimental results show that MiniDBG can locate the edges and paths efficiently with lower memory costs.
Collapse
|
2
|
Liu Y, Shen X, Gong Y, Liu Y, Song B, Zeng X. Sequence Alignment/Map format: a comprehensive review of approaches and applications. Brief Bioinform 2023; 24:bbad320. [PMID: 37668049 DOI: 10.1093/bib/bbad320] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 08/16/2023] [Accepted: 08/18/2023] [Indexed: 09/06/2023] Open
Abstract
The Sequence Alignment/Map (SAM) format file is the text file used to record alignment information. Alignment is the core of sequencing analysis, and downstream tasks accept mapping results for further processing. Given the rapid development of the sequencing industry today, a comprehensive understanding of the SAM format and related tools is necessary to meet the challenges of data processing and analysis. This paper is devoted to retrieving knowledge in the broad field of SAM. First, the format of SAM is introduced to understand the overall process of the sequencing analysis. Then, existing work is systematically classified in accordance with generation, compression and application, and the involved SAM tools are specifically mined. Lastly, a summary and some thoughts on future directions are provided.
Collapse
Affiliation(s)
- Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Xiangzhen Shen
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Yongshun Gong
- School of Software, Shandong University, 250100, Jinan, China
| | - Yiping Liu
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Bosheng Song
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| |
Collapse
|
3
|
Firtina C, Park J, Alser M, Kim JS, Cali D, Shahroodi T, Ghiasi N, Singh G, Kanellopoulos K, Alkan C, Mutlu O. BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis. NAR Genom Bioinform 2023; 5:lqad004. [PMID: 36685727 PMCID: PMC9853099 DOI: 10.1093/nargab/lqad004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 12/16/2022] [Accepted: 01/10/2023] [Indexed: 01/22/2023] Open
Abstract
Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either (i) increasing the use of the costly sequence alignment or (ii) limited sensitivity. We introduce BLEND, the first efficient and accurate mechanism that can identify both exact-matching and highly similar seeds with a single lookup of their hash values, called fuzzy seed matches. BLEND (i) utilizes a technique called SimHash, that can generate the same hash value for similar sets, and (ii) provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently. We show the benefits of BLEND when used in read overlapping and read mapping. For read overlapping, BLEND is faster by 2.4×-83.9× (on average 19.3×), has a lower memory footprint by 0.9×-14.1× (on average 3.8×), and finds higher quality overlaps leading to accurate de novo assemblies than the state-of-the-art tool, minimap2. For read mapping, BLEND is faster by 0.8×-4.1× (on average 1.7×) than minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND.
Collapse
Affiliation(s)
| | - Jisung Park
- ETH Zurich, Zurich 8092, Switzerland
- POSTECH, Pohang 37673, Republic of Korea
| | | | | | | | | | | | | | | | - Can Alkan
- Bilkent University, Ankara 06800, Turkey
| | | |
Collapse
|
4
|
Gudur VY, Maheshwari S, Acharyya A, Shafik R. An FPGA Based Energy-Efficient Read Mapper With Parallel Filtering and In-Situ Verification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2697-2711. [PMID: 34415836 DOI: 10.1109/tcbb.2021.3106311] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
In the assembly pipeline of Whole Genome Sequencing (WGS), read mapping is a widely used method to re-assemble the genome. It employs approximate string matching and dynamic programming-based algorithms on a large volume of data and associated structures, making it a computationally intensive process. Currently, the state-of-the-art data centers for genome sequencing incur substantial setup and energy costs for maintaining hardware, data storage and cooling systems. To enable low-cost genomics, we propose an energy-efficient architectural methodology for read mapping using a single system-on-chip (SoC) platform. The proposed methodology is based on the q-gram lemma and designed using a novel architecture for filtering and verification. The filtering algorithm is designed using a parallel sorted q-gram lemma based method for the first time, and it is complemented by an in-situ verification routine using parallel Myers bit-vector algorithm. We have implemented our design on the Zynq Ultrascale+ XCZU9EG MPSoC platform. It is then extensively validated using real genomic data to demonstrate up to 7.8× energy reduction and up to 13.3× less resource utilization when compared with the state-of-the-art software and hardware approaches.
Collapse
|
5
|
Alser M, Rotman J, Deshpande D, Taraszka K, Shi H, Baykal PI, Yang HT, Xue V, Knyazev S, Singer BD, Balliu B, Koslicki D, Skums P, Zelikovsky A, Alkan C, Mutlu O, Mangul S. Technology dictates algorithms: recent developments in read alignment. Genome Biol 2021; 22:249. [PMID: 34446078 PMCID: PMC8390189 DOI: 10.1186/s13059-021-02443-7] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 07/28/2021] [Indexed: 01/08/2023] Open
Abstract
Aligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today's diverse array of alignment methods. We provide a systematic survey of algorithmic foundations and methodologies across 107 alignment methods, for both short and long reads. We provide a rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read alignment. We discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology.
Collapse
Affiliation(s)
- Mohammed Alser
- Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
| | - Jeremy Rotman
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Dhrithi Deshpande
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA
| | - Kodi Taraszka
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Huwenbo Shi
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Pelin Icer Baykal
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Harry Taegyun Yang
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
- Bioinformatics Interdepartmental Ph.D. Program, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Victor Xue
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Benjamin D Singer
- Division of Pulmonary and Critical Care Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
- Department of Biochemistry & Molecular Genetics, Northwestern University Feinberg School of Medicine, Chicago, USA
- Simpson Querrey Institute for Epigenetics, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Brunilda Balliu
- Department of Computational Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - David Koslicki
- Computer Science and Engineering, Pennsylvania State University, University Park, PA, 16801, USA
- Biology Department, Pennsylvania State University, University Park, PA, 16801, USA
- The Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, 16801, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
- The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, 119991, Russia
| | - Can Alkan
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Bilkent-Hacettepe Health Sciences and Technologies Program, Ankara, Turkey
| | - Onur Mutlu
- Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA.
| |
Collapse
|
6
|
Abstract
For DNA sequence analysis, we are facing challenging tasks such as the identification of structural variants, sequencing repetitive regions, and phasing of alleles. Those challenging tasks suffer from the short length of sequencing reads, where each read may cover less than 2 single nucleotide polymorphism (SNP), or less than two occurrences of a repeated region. It is believed that long reads can help to solve those challenging tasks. In this study, we have designed new algorithms for mapping long reads to reference genomes. We have also designed efficient and effective heuristic algorithms for local alignments of long reads against the corresponding segments of the reference genome. To design the new mapping algorithm, we formulate the problem as the longest common subsequence with distance constraints. The local alignment heuristic algorithm is based on the idea of recursive alignment of k-mers, where the size of k differs in each round. We have implemented all the algorithms in C++ and produce a software package named mapAlign. Experiments on real data sets showed that the newly proposed approach can generate better alignments in terms of both identity and alignment scores for both Nanopore and single molecule real time sequencing (SMRT) data sets. For human individuals of both Nanopore and SMRT data sets, the new method can successfully math/align 91.53% and 85.36% of letters from reads to identical letters on reference genomes, respectively. In comparison, the best known method can only align 88.44% and 79.08% letters of reads for Nanopore and SMRT data sets, respectively. Our method is also faster than the best known method.
Collapse
Affiliation(s)
- Wen Yang
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
| | - Lusheng Wang
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong.,City University of Hong Kong Shenzhen Research Institution, Shenzhen, China
| |
Collapse
|
7
|
Comparison of High-Throughput Sequencing for Phage Display Peptide Screening on Two Commercially Available Platforms. Int J Pept Res Ther 2020. [DOI: 10.1007/s10989-019-09858-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
8
|
Frints SGM, Ozanturk A, Rodríguez Criado G, Grasshoff U, de Hoon B, Field M, Manouvrier-Hanu S, E Hickey S, Kammoun M, Gripp KW, Bauer C, Schroeder C, Toutain A, Mihalic Mosher T, Kelly BJ, White P, Dufke A, Rentmeester E, Moon S, Koboldt DC, van Roozendaal KEP, Hu H, Haas SA, Ropers HH, Murray L, Haan E, Shaw M, Carroll R, Friend K, Liebelt J, Hobson L, De Rademaeker M, Geraedts J, Fryns JP, Vermeesch J, Raynaud M, Riess O, Gribnau J, Katsanis N, Devriendt K, Bauer P, Gecz J, Golzio C, Gontan C, Kalscheuer VM. Pathogenic variants in E3 ubiquitin ligase RLIM/RNF12 lead to a syndromic X-linked intellectual disability and behavior disorder. Mol Psychiatry 2019; 24:1748-1768. [PMID: 29728705 DOI: 10.1038/s41380-018-0065-x] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/23/2017] [Accepted: 02/28/2018] [Indexed: 12/25/2022]
Abstract
RLIM, also known as RNF12, is an X-linked E3 ubiquitin ligase acting as a negative regulator of LIM-domain containing transcription factors and participates in X-chromosome inactivation (XCI) in mice. We report the genetic and clinical findings of 84 individuals from nine unrelated families, eight of whom who have pathogenic variants in RLIM (RING finger LIM domain-interacting protein). A total of 40 affected males have X-linked intellectual disability (XLID) and variable behavioral anomalies with or without congenital malformations. In contrast, 44 heterozygous female carriers have normal cognition and behavior, but eight showed mild physical features. All RLIM variants identified are missense changes co-segregating with the phenotype and predicted to affect protein function. Eight of the nine altered amino acids are conserved and lie either within a domain essential for binding interacting proteins or in the C-terminal RING finger catalytic domain. In vitro experiments revealed that these amino acid changes in the RLIM RING finger impaired RLIM ubiquitin ligase activity. In vivo experiments in rlim mutant zebrafish showed that wild type RLIM rescued the zebrafish rlim phenotype, whereas the patient-specific missense RLIM variants failed to rescue the phenotype and thus represent likely severe loss-of-function mutations. In summary, we identified a spectrum of RLIM missense variants causing syndromic XLID and affecting the ubiquitin ligase activity of RLIM, suggesting that enzymatic activity of RLIM is required for normal development, cognition and behavior.
Collapse
Affiliation(s)
- Suzanna G M Frints
- Department of Clinical Genetics, Maastricht University Medical Center+, azM, Maastricht, 6202 AZ, The Netherlands. .,Department of Genetics and Cell Biology, School for Oncology and Developmental Biology, GROW, FHML, Maastricht University, Maastricht, 6200 MD, The Netherlands.
| | - Aysegul Ozanturk
- Center for Human Disease Modeling and Departments of Pediatrics and Psychiatry, Duke University, Durham, NC, 27710, USA
| | | | - Ute Grasshoff
- Institute of Medical Genetics and Applied Genomics, University of Tübingen, Tübingen, 72076, Germany
| | - Bas de Hoon
- Department of Developmental Biology, Erasmus University Medical Center, Rotterdam, 3015 CN, Rotterdam, The Netherlands.,Department of Gynaecology and Obstetrics, Erasmus University Medical Center, Rotterdam, 3015 CN, The Netherlands
| | - Michael Field
- GOLD (Genetics of Learning and Disability) Service, Hunter Genetics, Waratah, NSW, 2298, Australia
| | - Sylvie Manouvrier-Hanu
- Clinique de Génétique médicale Guy Fontaine, Centre de référence maladies rares Anomalies du développement Hôpital Jeanne de Flandre, Lille, 59000, France.,EA 7364 RADEME Maladies Rares du Développement et du Métabolisme, Faculté de Médecine, Université de Lille, Lille, 59000, France
| | - Scott E Hickey
- Division of Molecular & Human Genetics, Nationwide Children's Hospital, Columbus, OH, 43205, USA.,Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, 43205, USA
| | - Molka Kammoun
- Center for Human Genetics, University Hospitals Leuven, Leuven, 3000, Belgium
| | - Karen W Gripp
- Alfred I. duPont Hospital for Children Nemours, Wilmington, DE, 19803, USA
| | - Claudia Bauer
- Institute of Medical Genetics and Applied Genomics, University of Tübingen, Tübingen, 72076, Germany
| | - Christopher Schroeder
- Institute of Medical Genetics and Applied Genomics, University of Tübingen, Tübingen, 72076, Germany
| | - Annick Toutain
- Service de Génétique, Hôpital Bretonneau, CHU de Tours, Tours, 37044, France.,UMR 1253, iBrain, Université de Tours, Inserm, Tours, 37032, France
| | - Theresa Mihalic Mosher
- Division of Molecular & Human Genetics, Nationwide Children's Hospital, Columbus, OH, 43205, USA.,Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, 43205, USA.,The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, 43205, USA
| | - Benjamin J Kelly
- The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, 43205, USA
| | - Peter White
- Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, 43205, USA.,The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, 43205, USA
| | - Andreas Dufke
- Institute of Medical Genetics and Applied Genomics, University of Tübingen, Tübingen, 72076, Germany
| | - Eveline Rentmeester
- Department of Developmental Biology, Erasmus University Medical Center, Rotterdam, 3015 CN, Rotterdam, The Netherlands
| | - Sungjin Moon
- Center for Human Disease Modeling and Departments of Pediatrics and Psychiatry, Duke University, Durham, NC, 27710, USA
| | - Daniel C Koboldt
- Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, 43205, USA.,The Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, 43205, USA
| | - Kees E P van Roozendaal
- Department of Clinical Genetics, Maastricht University Medical Center+, azM, Maastricht, 6202 AZ, The Netherlands.,Department of Genetics and Cell Biology, School for Oncology and Developmental Biology, GROW, FHML, Maastricht University, Maastricht, 6200 MD, The Netherlands
| | - Hao Hu
- Department of Human Molecular Genetics, Max Planck Institute for Molecular Genetics, Berlin, 14195, Germany
| | - Stefan A Haas
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, 14195, Germany
| | - Hans-Hilger Ropers
- Department of Human Molecular Genetics, Max Planck Institute for Molecular Genetics, Berlin, 14195, Germany
| | - Lucinda Murray
- GOLD (Genetics of Learning and Disability) Service, Hunter Genetics, Waratah, NSW, 2298, Australia
| | - Eric Haan
- Adelaide Medical School and Robinson Research Institute, The University of Adelaide, Adelaide, SA, 5000, Australia.,South Australian Clinical Genetics Service, SA Pathology (at Women's and Children's Hospital), North Adelaide, SA, 5006, Australia
| | - Marie Shaw
- Adelaide Medical School and Robinson Research Institute, The University of Adelaide, Adelaide, SA, 5000, Australia
| | - Renee Carroll
- Adelaide Medical School and Robinson Research Institute, The University of Adelaide, Adelaide, SA, 5000, Australia
| | - Kathryn Friend
- Genetics and Molecular Pathology, SA Pathology, Adelaide, SA, 5006, Australia
| | - Jan Liebelt
- South Australian Clinical Genetics Service, SA Pathology (at Women's and Children's Hospital), North Adelaide, SA, 5006, Australia
| | - Lynne Hobson
- Genetics and Molecular Pathology, SA Pathology, Adelaide, SA, 5006, Australia
| | - Marjan De Rademaeker
- Centre for Medical Genetics, Reproduction and Genetics, Reproduction Genetics and Regenerative Medicine, Vrije Universiteit Brussel (VUB), UZ Brussel, 1090, Brussels, Belgium
| | - Joep Geraedts
- Department of Clinical Genetics, Maastricht University Medical Center+, azM, Maastricht, 6202 AZ, The Netherlands.,Department of Genetics and Cell Biology, School for Oncology and Developmental Biology, GROW, FHML, Maastricht University, Maastricht, 6200 MD, The Netherlands
| | - Jean-Pierre Fryns
- Center for Human Genetics, University Hospitals Leuven, Leuven, 3000, Belgium
| | - Joris Vermeesch
- Center for Human Genetics, University Hospitals Leuven, Leuven, 3000, Belgium
| | - Martine Raynaud
- Service de Génétique, Hôpital Bretonneau, CHU de Tours, Tours, 37044, France.,UMR 1253, iBrain, Université de Tours, Inserm, Tours, 37032, France
| | - Olaf Riess
- Institute of Medical Genetics and Applied Genomics, University of Tübingen, Tübingen, 72076, Germany
| | - Joost Gribnau
- Department of Developmental Biology, Erasmus University Medical Center, Rotterdam, 3015 CN, Rotterdam, The Netherlands
| | - Nicholas Katsanis
- Center for Human Disease Modeling and Departments of Pediatrics and Psychiatry, Duke University, Durham, NC, 27710, USA
| | - Koen Devriendt
- Center for Human Genetics, University Hospitals Leuven, Leuven, 3000, Belgium
| | - Peter Bauer
- Institute of Medical Genetics and Applied Genomics, University of Tübingen, Tübingen, 72076, Germany
| | - Jozef Gecz
- Adelaide Medical School and Robinson Research Institute, The University of Adelaide, Adelaide, SA, 5000, Australia.,South Australian Health and Medical Research Institute, Adelaide, SA, 5000, Australia
| | - Christelle Golzio
- Center for Human Disease Modeling and Departments of Pediatrics and Psychiatry, Duke University, Durham, NC, 27710, USA.,Institut de Génétique et de Biologie Moléculaire et Cellulaire, Department of Translational Medicine and Neurogenetics; Centre National de la Recherche Scientifique, UMR7104; Institut National de la Santé et de la Recherche Médicale, U964, Université de Strasbourg, 67400, Illkirch, France
| | - Cristina Gontan
- Department of Developmental Biology, Erasmus University Medical Center, Rotterdam, 3015 CN, Rotterdam, The Netherlands
| | - Vera M Kalscheuer
- Research Group Development and Disease, Max Planck Institute for Molecular Genetics, Berlin, 14195, Germany.
| |
Collapse
|
9
|
Senol Cali D, Kim JS, Ghose S, Alkan C, Mutlu O. Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions. Brief Bioinform 2019; 20:1542-1559. [PMID: 29617724 PMCID: PMC6781587 DOI: 10.1093/bib/bby017] [Citation(s) in RCA: 108] [Impact Index Per Article: 21.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2017] [Revised: 02/06/2018] [Indexed: 02/06/2023] Open
Abstract
Nanopore sequencing technology has the potential to render other sequencing technologies obsolete with its ability to generate long reads and provide portability. However, high error rates of the technology pose a challenge while generating accurate genome assemblies. The tools used for nanopore sequence analysis are of critical importance, as they should overcome the high error rates of the technology. Our goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis to understand their advantages, disadvantages and performance bottlenecks. It is important to understand where the current tools do not perform well to develop better tools. To this end, we (1) analyze the multiple steps and the associated tools in the genome assembly pipeline using nanopore sequence data, and (2) provide guidelines for determining the appropriate tools for each step. Based on our analyses, we make four key observations: (1) the choice of the tool for basecalling plays a critical role in overcoming the high error rates of nanopore sequencing technology. (2) Read-to-read overlap finding tools, GraphMap and Minimap, perform similarly in terms of accuracy. However, Minimap has a lower memory usage, and it is faster than GraphMap. (3) There is a trade-off between accuracy and performance when deciding on the appropriate tool for the assembly step. The fast but less accurate assembler Miniasm can be used for quick initial assembly, and further polishing can be applied on top of it to increase the accuracy, which leads to faster overall assembly. (4) The state-of-the-art polishing tool, Racon, generates high-quality consensus sequences while providing a significant speedup over another polishing tool, Nanopolish. We analyze various combinations of different tools and expose the trade-offs between accuracy, performance, memory usage and scalability. We conclude that our observations can guide researchers and practitioners in making conscious and effective choices for each step of the genome assembly pipeline using nanopore sequence data. Also, with the help of bottlenecks we have found, developers can improve the current tools or build new ones that are both accurate and fast, to overcome the high error rates of the nanopore sequencing technology.
Collapse
Affiliation(s)
- Damla Senol Cali
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Jeremie S Kim
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
- Department of Computer Science, Systems Group, ETH Zürich, Zürich, Switzerland
| | - Saugata Ghose
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Bilkent, Ankara, Turkey
| | - Onur Mutlu
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
- Department of Computer Science, Systems Group, ETH Zürich, Zürich, Switzerland
| |
Collapse
|
10
|
Mozafari F, Babashah H, Koohi S, Kavehvash Z. Speeding up DNA sequence alignment by optical correlator. OPTICS & LASER TECHNOLOGY 2018; 108:124-135. [DOI: 10.1016/j.optlastec.2018.06.027] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/19/2023]
|
11
|
Alser M, Hassan H, Xin H, Ergin O, Mutlu O, Alkan C. GateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping. Bioinformatics 2018; 33:3355-3363. [PMID: 28575161 DOI: 10.1093/bioinformatics/btx342] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2016] [Accepted: 05/29/2017] [Indexed: 01/06/2023] Open
Abstract
Motivation High throughput DNA sequencing (HTS) technologies generate an excessive number of small DNA segments -called short reads- that cause significant computational burden. To analyze the entire genome, each of the billions of short reads must be mapped to a reference genome based on the similarity between a read and 'candidate' locations in that reference genome. The similarity measurement, called alignment, formulated as an approximate string matching problem, is the computational bottleneck because: (i) it is implemented using quadratic-time dynamic programming algorithms and (ii) the majority of candidate locations in the reference genome do not align with a given read due to high dissimilarity. Calculating the alignment of such incorrect candidate locations consumes an overwhelming majority of a modern read mapper's execution time. Therefore, it is crucial to develop a fast and effective filter that can detect incorrect candidate locations and eliminate them before invoking computationally costly alignment algorithms. Results We propose GateKeeper, a new hardware accelerator that functions as a pre-alignment step that quickly filters out most incorrect candidate locations. GateKeeper is the first design to accelerate pre-alignment using Field-Programmable Gate Arrays (FPGAs), which can perform pre-alignment much faster than software. When implemented on a single FPGA chip, GateKeeper maintains high accuracy (on average >96%) while providing, on average, 90-fold and 130-fold speedup over the state-of-the-art software pre-alignment techniques, Adjacency Filter and Shifted Hamming Distance (SHD), respectively. The addition of GateKeeper as a pre-alignment step can reduce the verification time of the mrFAST mapper by a factor of 10. Availability and implementation https://github.com/BilkentCompGen/GateKeeper. Contact mohammedalser@bilkent.edu.tr or onur.mutlu@inf.ethz.ch or calkan@cs.bilkent.edu.tr. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mohammed Alser
- Department of Computer Engineering, Bilkent University, Bilkent, Ankara 06800, Turkey
| | - Hasan Hassan
- TOBB University of Economics & Technology, Sogutozu, Ankara, Turkey
- Department of Computer Science, ETH Zürich, 8092 Zürich, Switzerland
| | - Hongyi Xin
- Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Oguz Ergin
- TOBB University of Economics & Technology, Sogutozu, Ankara, Turkey
| | - Onur Mutlu
- Department of Computer Science, ETH Zürich, 8092 Zürich, Switzerland
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Bilkent, Ankara 06800, Turkey
| |
Collapse
|
12
|
Kim JS, Senol Cali D, Xin H, Lee D, Ghose S, Alser M, Hassan H, Ergin O, Alkan C, Mutlu O. GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies. BMC Genomics 2018; 19:89. [PMID: 29764378 PMCID: PMC5954284 DOI: 10.1186/s12864-018-4460-0] [Citation(s) in RCA: 61] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Background Seed location filtering is critical in DNA read mapping, a process where billions of DNA fragments (reads) sampled from a donor are mapped onto a reference genome to identify genomic variants of the donor. State-of-the-art read mappers 1) quickly generate possible mapping locations for seeds (i.e., smaller segments) within each read, 2) extract reference sequences at each of the mapping locations, and 3) check similarity between each read and its associated reference sequences with a computationally-expensive algorithm (i.e., sequence alignment) to determine the origin of the read. A seed location filter comes into play before alignment, discarding seed locations that alignment would deem a poor match. The ideal seed location filter would discard all poor match locations prior to alignment such that there is no wasted computation on unnecessary alignments. Results We propose a novel seed location filtering algorithm, GRIM-Filter, optimized to exploit 3D-stacked memory systems that integrate computation within a logic layer stacked under memory layers, to perform processing-in-memory (PIM). GRIM-Filter quickly filters seed locations by 1) introducing a new representation of coarse-grained segments of the reference genome, and 2) using massively-parallel in-memory operations to identify read presence within each coarse-grained segment. Our evaluations show that for a sequence alignment error tolerance of 0.05, GRIM-Filter 1) reduces the false negative rate of filtering by 5.59x–6.41x, and 2) provides an end-to-end read mapper speedup of 1.81x–3.65x, compared to a state-of-the-art read mapper employing the best previous seed location filtering algorithm. Conclusion GRIM-Filter exploits 3D-stacked memory, which enables the efficient use of processing-in-memory, to overcome the memory bandwidth bottleneck in seed location filtering. We show that GRIM-Filter significantly improves the performance of a state-of-the-art read mapper. GRIM-Filter is a universal seed location filter that can be applied to any read mapper. We hope that our results provide inspiration for new works to design other bioinformatics algorithms that take advantage of emerging technologies and new processing paradigms, such as processing-in-memory using 3D-stacked memory devices.
Collapse
Affiliation(s)
- Jeremie S Kim
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA. .,Department of Computer Science, ETH Zürich, Zürich, CH, Switzerland.
| | - Damla Senol Cali
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Hongyi Xin
- Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | | | - Saugata Ghose
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Mohammed Alser
- Department of Computer Engineering, Bilkent University, Bilkent, Ankara, Turkey
| | - Hasan Hassan
- Department of Computer Science, ETH Zürich, Zürich, CH, Switzerland
| | - Oguz Ergin
- Department of Computer Engineering, TOBB University of Economics and Technology, Sogutozu, Ankara, Turkey
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Bilkent, Ankara, Turkey
| | - Onur Mutlu
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA. .,Department of Computer Science, ETH Zürich, Zürich, CH, Switzerland.
| |
Collapse
|
13
|
Almutairy M, Torng E. Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches. PLoS One 2018; 13:e0189960. [PMID: 29389989 PMCID: PMC5794061 DOI: 10.1371/journal.pone.0189960] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2017] [Accepted: 12/05/2017] [Indexed: 01/20/2023] Open
Abstract
Bioinformatics applications and pipelines increasingly use k-mer indexes to search for similar sequences. The major problem with k-mer indexes is that they require lots of memory. Sampling is often used to reduce index size and query time. Most applications use one of two major types of sampling: fixed sampling and minimizer sampling. It is well known that fixed sampling will produce a smaller index, typically by roughly a factor of two, whereas it is generally assumed that minimizer sampling will produce faster query times since query k-mers can also be sampled. However, no direct comparison of fixed and minimizer sampling has been performed to verify these assumptions. We systematically compare fixed and minimizer sampling using the human genome as our database. We use the resulting k-mer indexes for fixed sampling and minimizer sampling to find all maximal exact matches between our database, the human genome, and three separate query sets, the mouse genome, the chimp genome, and an NGS data set. We reach the following conclusions. First, using larger k-mers reduces query time for both fixed sampling and minimizer sampling at a cost of requiring more space. If we use the same k-mer size for both methods, fixed sampling requires typically half as much space whereas minimizer sampling processes queries only slightly faster. If we are allowed to use any k-mer size for each method, then we can choose a k-mer size such that fixed sampling both uses less space and processes queries faster than minimizer sampling. The reason is that although minimizer sampling is able to sample query k-mers, the number of shared k-mer occurrences that must be processed is much larger for minimizer sampling than fixed sampling. In conclusion, we argue that for any application where each shared k-mer occurrence must be processed, fixed sampling is the right sampling method.
Collapse
Affiliation(s)
- Meznah Almutairy
- Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America
- Department of Computer Science, College of Computer and Information Sciences, Imam Muhammad ibn Saud Islamic University, Riyadh, Saudi Arabia
| | - Eric Torng
- Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America
| |
Collapse
|
14
|
Kinghorn AB, Fraser LA, Liang S, Shiu SCC, Tanner JA. Aptamer Bioinformatics. Int J Mol Sci 2017; 18:E2516. [PMID: 29186809 PMCID: PMC5751119 DOI: 10.3390/ijms18122516] [Citation(s) in RCA: 89] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2017] [Revised: 11/17/2017] [Accepted: 11/20/2017] [Indexed: 02/07/2023] Open
Abstract
Aptamers are short nucleic acid sequences capable of specific, high-affinity molecular binding. They are isolated via SELEX (Systematic Evolution of Ligands by Exponential Enrichment), an evolutionary process that involves iterative rounds of selection and amplification before sequencing and aptamer characterization. As aptamers are genetic in nature, bioinformatic approaches have been used to improve both aptamers and their selection. This review will discuss the advancements made in several enclaves of aptamer bioinformatics, including simulation of aptamer selection, fragment-based aptamer design, patterning of libraries, identification of lead aptamers from high-throughput sequencing (HTS) data and in silico aptamer optimization.
Collapse
Affiliation(s)
| | | | | | | | - Julian A. Tanner
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong SAR China; (A.B.K.); (L.A.F.); (S.L.); (S.C.-C.S.)
| |
Collapse
|
15
|
Reinert K, Dadi TH, Ehrhardt M, Hauswedell H, Mehringer S, Rahn R, Kim J, Pockrandt C, Winkler J, Siragusa E, Urgese G, Weese D. The SeqAn C++ template library for efficient sequence analysis: A resource for programmers. J Biotechnol 2017; 261:157-168. [PMID: 28888961 DOI: 10.1016/j.jbiotec.2017.07.017] [Citation(s) in RCA: 67] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2017] [Revised: 07/17/2017] [Accepted: 07/19/2017] [Indexed: 11/27/2022]
Abstract
BACKGROUND The use of novel algorithmic techniques is pivotal to many important problems in life science. For example the sequencing of the human genome (Venter et al., 2001) would not have been possible without advanced assembly algorithms and the development of practical BWT based read mappers have been instrumental for NGS analysis. However, owing to the high speed of technological progress and the urgent need for bioinformatics tools, there was a widening gap between state-of-the-art algorithmic techniques and the actual algorithmic components of tools that are in widespread use. We previously addressed this by introducing the SeqAn library of efficient data types and algorithms in 2008 (Döring et al., 2008). RESULTS The SeqAn library has matured considerably since its first publication 9 years ago. In this article we review its status as an established resource for programmers in the field of sequence analysis and its contributions to many analysis tools. CONCLUSIONS We anticipate that SeqAn will continue to be a valuable resource, especially since it started to actively support various hardware acceleration techniques in a systematic manner.
Collapse
Affiliation(s)
- Knut Reinert
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany.
| | - Temesgen Hailemariam Dadi
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany
| | - Marcel Ehrhardt
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany
| | - Hannes Hauswedell
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany
| | - Svenja Mehringer
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany
| | - René Rahn
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany
| | - Jongkyu Kim
- Efficient Algorithms for -Omics Data, Max Planck Institute for Molecular Genetics, Ihnestrasse 62-73, 14195 Berlin, Germany
| | - Christopher Pockrandt
- Efficient Algorithms for -Omics Data, Max Planck Institute for Molecular Genetics, Ihnestrasse 62-73, 14195 Berlin, Germany
| | - Jörg Winkler
- Efficient Algorithms for -Omics Data, Max Planck Institute for Molecular Genetics, Ihnestrasse 62-73, 14195 Berlin, Germany
| | | | - Gianvito Urgese
- Department of Control and Computer Engineering, Politecnico di Torino, Italy
| | | |
Collapse
|
16
|
Tsai MH, Liu YY, Soo VW. PathoBacTyper: A Web Server for Pathogenic Bacteria Identification and Molecular Genotyping. Front Microbiol 2017; 8:1474. [PMID: 28824598 PMCID: PMC5540972 DOI: 10.3389/fmicb.2017.01474] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2017] [Accepted: 07/20/2017] [Indexed: 11/13/2022] Open
Abstract
With the decline in the cost of whole-genome sequencing because of the introduction of next-generation sequencing (NGS) techniques, many public health and clinical laboratories have started to use bacterial whole genomes for epidemiological surveillance and clinical investigation. For epidemiological and clinical purposes in this "NGS era," whole-genome-scale single nucleotide polymorphism (wgSNP) analysis for genotyping is considered suitable. In this paper, we present an online service, PathoBacTyper (http://halst.nhri.org.tw/PathoBacTyper/), for pathogenic bacteria identification and genotyping based on wgSNP analysis. More than 400 pathogenic bacteria can be identified and genotyped through this service. Four data sets containing 59 Salmonella Heidelberg isolates from three outbreaks with the same pulsed-field gel electrophoresis pattern, 34 Salmonella Typhimurium isolates from six outbreaks, 103 isolates of hospital-associated vancomycin-resistant Enterococcus faecium and 15 Legionella pneumophila isolates from clinical and environmental samples in Israel were used for demonstrating the operation and testing the performance of the PathoBacTyper service. The test results reveal the applicability of this service for epidemiological typing and clinical investigation.
Collapse
Affiliation(s)
- Ming-Hsin Tsai
- Institute of Population Health Sciences, National Health Research InstitutesMiaoli County, Taiwan.,Department of Computer Science, National Tsing Hua UniversityHsinchu, Taiwan
| | - Yen-Yi Liu
- Institute of Population Health Sciences, National Health Research InstitutesMiaoli County, Taiwan
| | - Von-Wun Soo
- Department of Computer Science, National Tsing Hua UniversityHsinchu, Taiwan
| |
Collapse
|
17
|
Almutairy M, Torng E. The effects of sampling on the efficiency and accuracy of k-mer indexes: Theoretical and empirical comparisons using the human genome. PLoS One 2017; 12:e0179046. [PMID: 28686614 PMCID: PMC5501444 DOI: 10.1371/journal.pone.0179046] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2016] [Accepted: 05/23/2017] [Indexed: 01/11/2023] Open
Abstract
One of the most common ways to search a sequence database for sequences that are similar to a query sequence is to use a k-mer index such as BLAST. A big problem with k-mer indexes is the space required to store the lists of all occurrences of all k-mers in the database. One method for reducing the space needed, and also query time, is sampling where only some k-mer occurrences are stored. Most previous work uses hard sampling, in which enough k-mer occurrences are retained so that all similar sequences are guaranteed to be found. In contrast, we study soft sampling, which further reduces the number of stored k-mer occurrences at a cost of decreasing query accuracy. We focus on finding highly similar local alignments (HSLA) over nucleotide sequences, an operation that is fundamental to biological applications such as cDNA sequence mapping. For our comparison, we use the NCBI BLAST tool with the human genome and human ESTs. When identifying HSLAs, we find that soft sampling significantly reduces both index size and query time with relatively small losses in query accuracy. For the human genome and HSLAs of length at least 100 bp, soft sampling reduces index size 4-10 times more than hard sampling and processes queries 2.3-6.8 times faster, while still achieving retention rates of at least 96.6%. When we apply soft sampling to the problem of mapping ESTs against the genome, we map more than 98% of ESTs perfectly while reducing the index size by a factor of 4 and query time by 23.3%. These results demonstrate that soft sampling is a simple but effective strategy for performing efficient searches for HSLAs. We also provide a new model for sampling with BLAST that predicts empirical retention rates with reasonable accuracy by modeling two key problem factors.
Collapse
Affiliation(s)
- Meznah Almutairy
- Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America
| | - Eric Torng
- Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America
| |
Collapse
|
18
|
Canzar S, Salzberg SL. Short Read Mapping: An Algorithmic Tour. PROCEEDINGS OF THE IEEE. INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS 2017; 105:436-458. [PMID: 28502990 PMCID: PMC5425171 DOI: 10.1109/jproc.2015.2455551] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Ultra-high-throughput next-generation sequencing (NGS) technology allows us to determine the sequence of nucleotides of many millions of DNA molecules in parallel. Accompanied by a dramatic reduction in cost since its introduction in 2004, NGS technology has provided a new way of addressing a wide range of biological and biomedical questions, from the study of human genetic disease to the analysis of gene expression, protein-DNA interactions, and patterns of DNA methylation. The data generated by NGS instruments comprise huge numbers of very short DNA sequences, or 'reads', that carry little information by themselves. These reads therefore have to be pieced together by well-engineered algorithms to reconstruct biologically meaningful measurments, such as the level of expression of a gene. To solve this complex, high-dimensional puzzle, reads must be mapped back to a reference genome to determine their origin Due to sequencing errors and to genuine differences between the reference genome and the individual being sequenced, this mapping process must be tolerant of mismatches, insertions, and deletions. Although optimal alignment algorithms to solve this problem have long been available, the practical requirements of aligning hundreds of millions of short reads to the 3 billion base pair long human genome have stimulated the development of new, more efficient methods, which today are used routinely throughout the world for the analysis of NGS data.
Collapse
|
19
|
From next-generation resequencing reads to a high-quality variant data set. Heredity (Edinb) 2016; 118:111-124. [PMID: 27759079 DOI: 10.1038/hdy.2016.102] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2016] [Revised: 09/03/2016] [Accepted: 09/06/2016] [Indexed: 12/11/2022] Open
Abstract
Sequencing has revolutionized biology by permitting the analysis of genomic variation at an unprecedented resolution. High-throughput sequencing is fast and inexpensive, making it accessible for a wide range of research topics. However, the produced data contain subtle but complex types of errors, biases and uncertainties that impose several statistical and computational challenges to the reliable detection of variants. To tap the full potential of high-throughput sequencing, a thorough understanding of the data produced as well as the available methodologies is required. Here, I review several commonly used methods for generating and processing next-generation resequencing data, discuss the influence of errors and biases together with their resulting implications for downstream analyses and provide general guidelines and recommendations for producing high-quality single-nucleotide polymorphism data sets from raw reads by highlighting several sophisticated reference-based methods representing the current state of the art.
Collapse
|
20
|
Do LAH, Wilm A, van Doorn HR, Lam HM, Sim S, Sukumaran R, Tran AT, Nguyen BH, Tran TTL, Tran QH, Vo QB, Dac NAT, Trinh HN, Nguyen TTH, Binh BTL, Le K, Nguyen MT, Thai QT, Vo TV, Ngo NQM, Dang TKH, Cao NH, Tran TV, Ho LV, Farrar J, de Jong M, Chen S, Nagarajan N, Bryant JE, Hibberd ML. Direct whole-genome deep-sequencing of human respiratory syncytial virus A and B from Vietnamese children identifies distinct patterns of inter- and intra-host evolution. J Gen Virol 2016; 96:3470-3483. [PMID: 26407694 DOI: 10.1099/jgv.0.000298] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
Human respiratory syncytial virus (RSV) is the major cause of lower respiratory tract infections in children ,2 years of age. Little is known about RSV intra-host genetic diversity over the course of infection or about the immune pressures that drive RSV molecular evolution. We performed whole-genome deep-sequencing on 53 RSV-positive samples (37 RSV subgroup A and 16 RSV subgroup B) collected from the upper airways of hospitalized children in southern Vietnam over two consecutive seasons. RSV A NA1 and RSV B BA9 were the predominant genotypes found in our samples, consistent with other reports on global RSV circulation during the same period. For both RSV A and B, the M gene was the most conserved, confirming its potential as a target for novel therapeutics. The G gene was the most variable and was the only gene under detectable positive selection. Further, positively selected sites inG were found in close proximity to and in some cases overlapped with predicted glycosylation motifs, suggesting that selection on amino acid glycosylation may drive viral genetic diversity. We further identified hotspots and coldspots of intra-host genetic diversity in the RSV genome, some of which may highlight previously unknown regions of functional importance.
Collapse
Affiliation(s)
- Lien Anh Ha Do
- Oxford University Clinical Research Unit, Wellcome Trust Major Overseas Program, Ho Chi Minh City, Vietnam
| | - Andreas Wilm
- Genome Institute of Singapore, Genome Building, 138672 Singapore
| | - H Rogier van Doorn
- Oxford University Clinical Research Unit, Wellcome Trust Major Overseas Program, Ho Chi Minh City, Vietnam.,Nuffield Department of Clinical Medicine, University of Oxford, Oxford, UK
| | - Ha Minh Lam
- Oxford University Clinical Research Unit, Wellcome Trust Major Overseas Program, Ho Chi Minh City, Vietnam
| | - Shuzhen Sim
- Genome Institute of Singapore, Genome Building, 138672 Singapore
| | - Rashmi Sukumaran
- Genome Institute of Singapore, Genome Building, 138672 Singapore
| | - Anh Tuan Tran
- Children's Hospital 1, Ward 10, District 10, Ho Chi Minh City, Vietnam
| | - Bach Hue Nguyen
- Children's Hospital 1, Ward 10, District 10, Ho Chi Minh City, Vietnam
| | - Thi Thu Loan Tran
- Children's Hospital 2, Ben Nghe Ward, District 1, Ho Chi Minh City, Vietnam
| | - Quynh Huong Tran
- Children's Hospital 2, Ben Nghe Ward, District 1, Ho Chi Minh City, Vietnam
| | - Quoc Bao Vo
- Children's Hospital 2, Ben Nghe Ward, District 1, Ho Chi Minh City, Vietnam
| | | | - Hong Nhien Trinh
- Children's Hospital 1, Ward 10, District 10, Ho Chi Minh City, Vietnam
| | | | - Bao Tinh Le Binh
- Children's Hospital 1, Ward 10, District 10, Ho Chi Minh City, Vietnam
| | - Khanh Le
- Children's Hospital 1, Ward 10, District 10, Ho Chi Minh City, Vietnam
| | - Minh Tien Nguyen
- Children's Hospital 1, Ward 10, District 10, Ho Chi Minh City, Vietnam
| | - Quang Tung Thai
- Children's Hospital 1, Ward 10, District 10, Ho Chi Minh City, Vietnam
| | - Thanh Vu Vo
- Children's Hospital 1, Ward 10, District 10, Ho Chi Minh City, Vietnam
| | | | - Thi Kim Huyen Dang
- Children's Hospital 2, Ben Nghe Ward, District 1, Ho Chi Minh City, Vietnam
| | - Ngoc Huong Cao
- Children's Hospital 2, Ben Nghe Ward, District 1, Ho Chi Minh City, Vietnam
| | - Thu Van Tran
- Children's Hospital 2, Ben Nghe Ward, District 1, Ho Chi Minh City, Vietnam
| | - Lu Viet Ho
- Children's Hospital 2, Ben Nghe Ward, District 1, Ho Chi Minh City, Vietnam
| | - Jeremy Farrar
- Oxford University Clinical Research Unit, Wellcome Trust Major Overseas Program, Ho Chi Minh City, Vietnam
| | - Menno de Jong
- Oxford University Clinical Research Unit, Wellcome Trust Major Overseas Program, Ho Chi Minh City, Vietnam.,Nuffield Department of Clinical Medicine, University of Oxford, Oxford, UK.,Department of Medical Microbiology, Academic Medical Center, University of Amsterdam, Amsterdam, The Netherlands
| | - Swaine Chen
- Genome Institute of Singapore, Genome Building, 138672 Singapore
| | | | - Juliet E Bryant
- Oxford University Clinical Research Unit, Wellcome Trust Major Overseas Program, Ho Chi Minh City, Vietnam.,Nuffield Department of Clinical Medicine, University of Oxford, Oxford, UK
| | - Martin L Hibberd
- Genome Institute of Singapore, Genome Building, 138672 Singapore
| |
Collapse
|
21
|
Mapping and differential expression analysis from short-read RNA-Seq data in model organisms. QUANTITATIVE BIOLOGY 2016. [DOI: 10.1007/s40484-016-0060-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
22
|
Kalscheuer VM, James VM, Himelright ML, Long P, Oegema R, Jensen C, Bienek M, Hu H, Haas SA, Topf M, Hoogeboom AJM, Harvey K, Walikonis R, Harvey RJ. Novel Missense Mutation A789V in IQSEC2 Underlies X-Linked Intellectual Disability in the MRX78 Family. Front Mol Neurosci 2016; 8:85. [PMID: 26793055 PMCID: PMC4707274 DOI: 10.3389/fnmol.2015.00085] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2015] [Accepted: 12/14/2015] [Indexed: 12/04/2022] Open
Abstract
Disease gene discovery in neurodevelopmental disorders, including X-linked intellectual disability (XLID) has recently been accelerated by next-generation DNA sequencing approaches. To date, more than 100 human X chromosome genes involved in neuronal signaling pathways and networks implicated in cognitive function have been identified. Despite these advances, the mutations underlying disease in a large number of XLID families remained unresolved. We report the resolution of MRX78, a large family with six affected males and seven affected females, showing X-linked inheritance. Although a previous linkage study had mapped the locus to the short arm of chromosome X (Xp11.4-p11.23), this region contained too many candidate genes to be analyzed using conventional approaches. However, our X-chromosome exome resequencing, bioinformatics analysis and inheritance testing revealed a missense mutation (c.C2366T, p.A789V) in IQSEC2, encoding a neuronal GDP-GTP exchange factor for Arf family GTPases (ArfGEF) previously implicated in XLID. Molecular modeling of IQSEC2 revealed that the A789V substitution results in the insertion of a larger side-chain into a hydrophobic pocket in the catalytic Sec7 domain of IQSEC2. The A789V change is predicted to result in numerous clashes with adjacent amino acids and disruption of local folding of the Sec7 domain. Consistent with this finding, functional assays revealed that recombinant IQSEC2A789V was not able to catalyze GDP-GTP exchange on Arf6 as efficiently as wild-type IQSEC2. Taken together, these results strongly suggest that the A789V mutation in IQSEC2 is the underlying cause of XLID in the MRX78 family.
Collapse
Affiliation(s)
- Vera M Kalscheuer
- Department of Human Molecular Genetics, Max Planck Institute for Molecular GeneticsBerlin, Germany; Research Group Development and Disease, Max Planck Institute for Molecular GeneticsBerlin, Germany
| | | | - Miranda L Himelright
- Department of Physiology and Neurobiology, University of Connecticut Storrs, CT, USA
| | - Philip Long
- Department of Pharmacology, UCL School of Pharmacy London, UK
| | - Renske Oegema
- Department of Clinical Genetics, Erasmus MC University Medical Center Rotterdam Rotterdam, Netherlands
| | - Corinna Jensen
- Department of Human Molecular Genetics, Max Planck Institute for Molecular Genetics Berlin, Germany
| | - Melanie Bienek
- Department of Human Molecular Genetics, Max Planck Institute for Molecular Genetics Berlin, Germany
| | - Hao Hu
- Department of Human Molecular Genetics, Max Planck Institute for Molecular Genetics Berlin, Germany
| | - Stefan A Haas
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics Berlin, Germany
| | - Maya Topf
- Department of Biological Sciences, Institute for Structural and Molecular Biology, Birkbeck College London, UK
| | - A Jeannette M Hoogeboom
- Department of Clinical Genetics, Erasmus MC University Medical Center Rotterdam Rotterdam, Netherlands
| | - Kirsten Harvey
- Department of Pharmacology, UCL School of Pharmacy London, UK
| | - Randall Walikonis
- Department of Physiology and Neurobiology, University of Connecticut Storrs, CT, USA
| | - Robert J Harvey
- Department of Pharmacology, UCL School of Pharmacy London, UK
| |
Collapse
|
23
|
Hu H, Haas SA, Chelly J, Van Esch H, Raynaud M, de Brouwer APM, Weinert S, Froyen G, Frints SGM, Laumonnier F, Zemojtel T, Love MI, Richard H, Emde AK, Bienek M, Jensen C, Hambrock M, Fischer U, Langnick C, Feldkamp M, Wissink-Lindhout W, Lebrun N, Castelnau L, Rucci J, Montjean R, Dorseuil O, Billuart P, Stuhlmann T, Shaw M, Corbett MA, Gardner A, Willis-Owen S, Tan C, Friend KL, Belet S, van Roozendaal KEP, Jimenez-Pocquet M, Moizard MP, Ronce N, Sun R, O'Keeffe S, Chenna R, van Bömmel A, Göke J, Hackett A, Field M, Christie L, Boyle J, Haan E, Nelson J, Turner G, Baynam G, Gillessen-Kaesbach G, Müller U, Steinberger D, Budny B, Badura-Stronka M, Latos-Bieleńska A, Ousager LB, Wieacker P, Rodríguez Criado G, Bondeson ML, Annerén G, Dufke A, Cohen M, Van Maldergem L, Vincent-Delorme C, Echenne B, Simon-Bouy B, Kleefstra T, Willemsen M, Fryns JP, Devriendt K, Ullmann R, Vingron M, Wrogemann K, Wienker TF, Tzschach A, van Bokhoven H, Gecz J, Jentsch TJ, Chen W, Ropers HH, Kalscheuer VM. X-exome sequencing of 405 unresolved families identifies seven novel intellectual disability genes. Mol Psychiatry 2016; 21:133-48. [PMID: 25644381 PMCID: PMC5414091 DOI: 10.1038/mp.2014.193] [Citation(s) in RCA: 208] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/04/2014] [Revised: 11/17/2014] [Accepted: 12/08/2014] [Indexed: 12/27/2022]
Abstract
X-linked intellectual disability (XLID) is a clinically and genetically heterogeneous disorder. During the past two decades in excess of 100 X-chromosome ID genes have been identified. Yet, a large number of families mapping to the X-chromosome remained unresolved suggesting that more XLID genes or loci are yet to be identified. Here, we have investigated 405 unresolved families with XLID. We employed massively parallel sequencing of all X-chromosome exons in the index males. The majority of these males were previously tested negative for copy number variations and for mutations in a subset of known XLID genes by Sanger sequencing. In total, 745 X-chromosomal genes were screened. After stringent filtering, a total of 1297 non-recurrent exonic variants remained for prioritization. Co-segregation analysis of potential clinically relevant changes revealed that 80 families (20%) carried pathogenic variants in established XLID genes. In 19 families, we detected likely causative protein truncating and missense variants in 7 novel and validated XLID genes (CLCN4, CNKSR2, FRMPD4, KLHL15, LAS1L, RLIM and USP27X) and potentially deleterious variants in 2 novel candidate XLID genes (CDK16 and TAF1). We show that the CLCN4 and CNKSR2 variants impair protein functions as indicated by electrophysiological studies and altered differentiation of cultured primary neurons from Clcn4(-/-) mice or after mRNA knock-down. The newly identified and candidate XLID proteins belong to pathways and networks with established roles in cognitive function and intellectual disability in particular. We suggest that systematic sequencing of all X-chromosomal genes in a cohort of patients with genetic evidence for X-chromosome locus involvement may resolve up to 58% of Fragile X-negative cases.
Collapse
Affiliation(s)
- H Hu
- Department of Human Molecular Genetics, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - S A Haas
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - J Chelly
- University Paris Descartes, Paris, France,Centre National de la Recherche Scientifique Unité Mixte de Recherche 8104, Institut National de la Santé et de la Recherche Médicale Unité 1016, Institut Cochin, Paris, France
| | - H Van Esch
- Center for Human Genetics, University Hospitals Leuven, Leuven, Belgium
| | - M Raynaud
- Inserm U930 ‘Imaging and Brain', Tours, France,University François-Rabelais, Tours, France,Centre Hospitalier Régional Universitaire, Service de Génétique, Tours, France
| | - A P M de Brouwer
- Department of Human Genetics, Radboud University Medical Center, Donders Institute for Brain, Cognition and Behaviour, Nijmegen, The Netherlands
| | - S Weinert
- Max-Delbrück-Centrum für Molekulare Medizin, Berlin, Germany,Leibniz-Institut für Molekulare Pharmakologie, Berlin, Germany
| | - G Froyen
- Human Genome Laboratory, VIB Center for the Biology of Disease, Leuven, Belgium,Human Genome Laboratory, Department of Human Genetics, K.U. Leuven, Leuven, Belgium
| | - S G M Frints
- Department of Clinical Genetics, Maastricht University Medical Center, azM, Maastricht, The Netherlands,School for Oncology and Developmental Biology, GROW, Maastricht University, Maastricht, The Netherlands
| | - F Laumonnier
- Inserm U930 ‘Imaging and Brain', Tours, France,University François-Rabelais, Tours, France
| | - T Zemojtel
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - M I Love
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - H Richard
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - A-K Emde
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - M Bienek
- Department of Human Molecular Genetics, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - C Jensen
- Department of Human Molecular Genetics, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - M Hambrock
- Department of Human Molecular Genetics, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - U Fischer
- Department of Human Molecular Genetics, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - C Langnick
- Max-Delbrück-Centrum für Molekulare Medizin, Berlin, Germany
| | - M Feldkamp
- Max-Delbrück-Centrum für Molekulare Medizin, Berlin, Germany
| | - W Wissink-Lindhout
- Department of Human Genetics, Radboud University Medical Center, Donders Institute for Brain, Cognition and Behaviour, Nijmegen, The Netherlands
| | - N Lebrun
- University Paris Descartes, Paris, France,Centre National de la Recherche Scientifique Unité Mixte de Recherche 8104, Institut National de la Santé et de la Recherche Médicale Unité 1016, Institut Cochin, Paris, France
| | - L Castelnau
- University Paris Descartes, Paris, France,Centre National de la Recherche Scientifique Unité Mixte de Recherche 8104, Institut National de la Santé et de la Recherche Médicale Unité 1016, Institut Cochin, Paris, France
| | - J Rucci
- University Paris Descartes, Paris, France,Centre National de la Recherche Scientifique Unité Mixte de Recherche 8104, Institut National de la Santé et de la Recherche Médicale Unité 1016, Institut Cochin, Paris, France
| | - R Montjean
- University Paris Descartes, Paris, France,Centre National de la Recherche Scientifique Unité Mixte de Recherche 8104, Institut National de la Santé et de la Recherche Médicale Unité 1016, Institut Cochin, Paris, France
| | - O Dorseuil
- University Paris Descartes, Paris, France,Centre National de la Recherche Scientifique Unité Mixte de Recherche 8104, Institut National de la Santé et de la Recherche Médicale Unité 1016, Institut Cochin, Paris, France
| | - P Billuart
- University Paris Descartes, Paris, France,Centre National de la Recherche Scientifique Unité Mixte de Recherche 8104, Institut National de la Santé et de la Recherche Médicale Unité 1016, Institut Cochin, Paris, France
| | - T Stuhlmann
- Max-Delbrück-Centrum für Molekulare Medizin, Berlin, Germany,Leibniz-Institut für Molekulare Pharmakologie, Berlin, Germany
| | - M Shaw
- School of Paediatrics and Reproductive Health, The University of Adelaide, Adelaide, SA, Australia,Robinson Research Institute, The University of Adelaide, Adelaide, SA, Australia
| | - M A Corbett
- School of Paediatrics and Reproductive Health, The University of Adelaide, Adelaide, SA, Australia,Robinson Research Institute, The University of Adelaide, Adelaide, SA, Australia
| | - A Gardner
- School of Paediatrics and Reproductive Health, The University of Adelaide, Adelaide, SA, Australia,Robinson Research Institute, The University of Adelaide, Adelaide, SA, Australia
| | - S Willis-Owen
- School of Paediatrics and Reproductive Health, The University of Adelaide, Adelaide, SA, Australia,National Heart and Lung Institute, Imperial College London, London, UK
| | - C Tan
- School of Paediatrics and Reproductive Health, The University of Adelaide, Adelaide, SA, Australia
| | - K L Friend
- SA Pathology, Women's and Children's Hospital, Adelaide, SA, Australia
| | - S Belet
- Human Genome Laboratory, VIB Center for the Biology of Disease, Leuven, Belgium,Human Genome Laboratory, Department of Human Genetics, K.U. Leuven, Leuven, Belgium
| | - K E P van Roozendaal
- Department of Clinical Genetics, Maastricht University Medical Center, azM, Maastricht, The Netherlands,School for Oncology and Developmental Biology, GROW, Maastricht University, Maastricht, The Netherlands
| | - M Jimenez-Pocquet
- Centre Hospitalier Régional Universitaire, Service de Génétique, Tours, France
| | - M-P Moizard
- Inserm U930 ‘Imaging and Brain', Tours, France,University François-Rabelais, Tours, France,Centre Hospitalier Régional Universitaire, Service de Génétique, Tours, France
| | - N Ronce
- Inserm U930 ‘Imaging and Brain', Tours, France,University François-Rabelais, Tours, France,Centre Hospitalier Régional Universitaire, Service de Génétique, Tours, France
| | - R Sun
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - S O'Keeffe
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - R Chenna
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - A van Bömmel
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - J Göke
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - A Hackett
- Genetics of Learning and Disability Service, Hunter Genetics, Waratah, NSW, Australia
| | - M Field
- Genetics of Learning and Disability Service, Hunter Genetics, Waratah, NSW, Australia
| | - L Christie
- Genetics of Learning and Disability Service, Hunter Genetics, Waratah, NSW, Australia
| | - J Boyle
- Genetics of Learning and Disability Service, Hunter Genetics, Waratah, NSW, Australia
| | - E Haan
- School of Paediatrics and Reproductive Health, The University of Adelaide, Adelaide, SA, Australia,SA Pathology, Women's and Children's Hospital, Adelaide, SA, Australia
| | - J Nelson
- Genetic Services of Western Australia, King Edward Memorial Hospital, Perth, WA, Australia
| | - G Turner
- Genetics of Learning and Disability Service, Hunter Genetics, Waratah, NSW, Australia
| | - G Baynam
- Genetic Services of Western Australia, King Edward Memorial Hospital, Perth, WA, Australia,School of Paediatrics and Child Health, University of Western Australia, Perth, WA, Australia,Institute for Immunology and Infectious Diseases, Murdoch University, Perth, WA, Australia,Telethon Kids Institute, Perth, WA, Australia
| | | | - U Müller
- Institut für Humangenetik, Justus-Liebig-Universität Giessen, Giessen, Germany,bio.logis Center for Human Genetics, Frankfurt a. M., Germany
| | - D Steinberger
- Institut für Humangenetik, Justus-Liebig-Universität Giessen, Giessen, Germany,bio.logis Center for Human Genetics, Frankfurt a. M., Germany
| | - B Budny
- Chair and Department of Endocrinology, Metabolism and Internal Diseases, Ponzan University of Medical Sciences, Poznan, Poland
| | - M Badura-Stronka
- Chair and Department of Medical Genetics, Poznan University of Medical Sciences, Poznan, Poland
| | - A Latos-Bieleńska
- Chair and Department of Medical Genetics, Poznan University of Medical Sciences, Poznan, Poland
| | - L B Ousager
- Department of Clinical Genetics, Odense University Hospital, Odense, Denmark
| | - P Wieacker
- Institut für Humangenetik, Universitätsklinikum Münster, Muenster, Germany
| | | | - M-L Bondeson
- Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
| | - G Annerén
- Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
| | - A Dufke
- Institut für Medizinische Genetik und Angewandte Genomik, Tübingen, Germany
| | - M Cohen
- Kinderzentrum München, München, Germany
| | - L Van Maldergem
- Centre de Génétique Humaine, Université de Franche-Comté, Besançon, France
| | - C Vincent-Delorme
- Service de Génétique, Hôpital Jeanne de Flandre CHRU de Lilles, Lille, France
| | - B Echenne
- Service de Neuro-Pédiatrie, CHU Montpellier, Montpellier, France
| | - B Simon-Bouy
- Laboratoire SESEP, Centre hospitalier de Versailles, Le Chesnay, France
| | - T Kleefstra
- Department of Human Genetics, Radboud University Medical Center, Donders Institute for Brain, Cognition and Behaviour, Nijmegen, The Netherlands
| | - M Willemsen
- Department of Human Genetics, Radboud University Medical Center, Donders Institute for Brain, Cognition and Behaviour, Nijmegen, The Netherlands
| | - J-P Fryns
- Center for Human Genetics, University Hospitals Leuven, Leuven, Belgium
| | - K Devriendt
- Center for Human Genetics, University Hospitals Leuven, Leuven, Belgium
| | - R Ullmann
- Department of Human Molecular Genetics, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - M Vingron
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - K Wrogemann
- Department of Human Molecular Genetics, Max Planck Institute for Molecular Genetics, Berlin, Germany,Department of Biochemistry and Medical Genetics, University of Manitoba, Winnipeg, MB, Canada
| | - T F Wienker
- Department of Human Molecular Genetics, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - A Tzschach
- Department of Human Molecular Genetics, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - H van Bokhoven
- Department of Human Genetics, Radboud University Medical Center, Donders Institute for Brain, Cognition and Behaviour, Nijmegen, The Netherlands
| | - J Gecz
- School of Paediatrics and Reproductive Health, The University of Adelaide, Adelaide, SA, Australia,Robinson Research Institute, The University of Adelaide, Adelaide, SA, Australia
| | - T J Jentsch
- Max-Delbrück-Centrum für Molekulare Medizin, Berlin, Germany,Leibniz-Institut für Molekulare Pharmakologie, Berlin, Germany
| | - W Chen
- Department of Human Molecular Genetics, Max Planck Institute for Molecular Genetics, Berlin, Germany,Max-Delbrück-Centrum für Molekulare Medizin, Berlin, Germany
| | - H-H Ropers
- Department of Human Molecular Genetics, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - V M Kalscheuer
- Department of Human Molecular Genetics, Max Planck Institute for Molecular Genetics, Berlin, Germany,Max Planck Institute for Molecular Genetics, Ihnestrasse 73, Berlin 14195, Germany. E-mail:
| |
Collapse
|
24
|
Reinert K, Langmead B, Weese D, Evers DJ. Alignment of Next-Generation Sequencing Reads. Annu Rev Genomics Hum Genet 2015; 16:133-51. [DOI: 10.1146/annurev-genom-090413-025358] [Citation(s) in RCA: 82] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Knut Reinert
- Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany; ,
| | - Ben Langmead
- Department of Computer Science and Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland 21218;
| | - David Weese
- Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany; ,
| | | |
Collapse
|
25
|
Lim JQ, Tennakoon C, Guan P, Sung WK. BatAlign: an incremental method for accurate alignment of sequencing reads. Nucleic Acids Res 2015; 43:e107. [PMID: 26170239 PMCID: PMC4652746 DOI: 10.1093/nar/gkv533] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2015] [Accepted: 05/09/2015] [Indexed: 11/12/2022] Open
Abstract
Structural variations (SVs) play a crucial role in genetic diversity. However, the alignments of reads near/across SVs are made inaccurate by the presence of polymorphisms. BatAlign is an algorithm that integrated two strategies called 'Reverse-Alignment' and 'Deep-Scan' to improve the accuracy of read-alignment. In our experiments, BatAlign was able to obtain the highest F-measures in read-alignments on mismatch-aberrant, indel-aberrant, concordantly/discordantly paired and SV-spanning data sets. On real data, the alignments of BatAlign were able to recover 4.3% more PCR-validated SVs with 73.3% less callings. These suggest BatAlign to be effective in detecting SVs and other polymorphic-variants accurately using high-throughput data. BatAlign is publicly available at https://goo.gl/a6phxB.
Collapse
Affiliation(s)
- Jing-Quan Lim
- Department of Computer Science, National University of Singapore, Singapore 117417 Laboratory of Cancer Epigenome, Division of Medical Sciences, National Cancer Centre Singapore, Singapore 169610
| | - Chandana Tennakoon
- Department of Computer Science, National University of Singapore, Singapore 117417 NUS Graduate School for Integrative Sciences and Engineering, (CeLS), #05-01, 28 Medical Drive, Singapore 117456 Department of Computational and Systems Biology, Genome Institute of Singapore, Singapore 138672 UAE University, PO Box 17551, Al Ain, UAE
| | - Peiyong Guan
- Department of Computer Science, National University of Singapore, Singapore 117417
| | - Wing-Kin Sung
- Department of Computer Science, National University of Singapore, Singapore 117417 Department of Computational and Systems Biology, Genome Institute of Singapore, Singapore 138672
| |
Collapse
|
26
|
Cheng H, Jiang H, Yang J, Xu Y, Shang Y. BitMapper: an efficient all-mapper based on bit-vector computing. BMC Bioinformatics 2015; 16:192. [PMID: 26063651 PMCID: PMC4462005 DOI: 10.1186/s12859-015-0626-9] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2014] [Accepted: 05/22/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND As the next-generation sequencing (NGS) technologies producing hundreds of millions of reads every day, a tremendous computational challenge is to map NGS reads to a given reference genome efficiently. However, existing methods of all-mappers, which aim at finding all mapping locations of each read, are very time consuming. The majority of existing all-mappers consist of 2 main parts, filtration and verification. This work significantly reduces verification time, which is the dominant part of the running time. RESULTS An efficient all-mapper, BitMapper, is developed based on a new vectorized bit-vector algorithm, which simultaneously calculates the edit distance of one read to multiple locations in a given reference genome. Experimental results on both simulated and real data sets show that BitMapper is from several times to an order of magnitude faster than the current state-of-the-art all-mappers, while achieving higher sensitivity, i.e., better quality solutions. CONCLUSIONS We present BitMapper, which is designed to return all mapping locations of raw reads containing indels as well as mismatches. BitMapper is implemented in C under a GPL license. Binaries are freely available at http://home.ustc.edu.cn/%7Echhy.
Collapse
Affiliation(s)
- Haoyu Cheng
- Key Laboratory on High Performance Computing, Hefei, Anhui230027, P.R. China. .,School of Computer Science, University of Science and Technology of China, Hefei, Anhui, 230027, P.R. China.
| | - Huaipan Jiang
- Key Laboratory on High Performance Computing, Hefei, Anhui230027, P.R. China. .,School of Computer Science, University of Science and Technology of China, Hefei, Anhui, 230027, P.R. China.
| | - Jiaoyun Yang
- Hefei University of Technology, Hefei, 230009, China.
| | - Yun Xu
- Key Laboratory on High Performance Computing, Hefei, Anhui230027, P.R. China. .,School of Computer Science, University of Science and Technology of China, Hefei, Anhui, 230027, P.R. China.
| | - Yi Shang
- Department of Computer Science, University of Missouri-Columbia, Columbia MO, 65203, USA.
| |
Collapse
|
27
|
Evaluation and application of the strand-specific protocol for next-generation sequencing. BIOMED RESEARCH INTERNATIONAL 2015; 2015:182389. [PMID: 25893191 PMCID: PMC4393923 DOI: 10.1155/2015/182389] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/11/2014] [Accepted: 02/03/2015] [Indexed: 12/02/2022]
Abstract
Next-generation sequencing (NGS) has become a powerful sequencing tool, applied in a wide range of biological studies. However, the traditional sample preparation protocol for NGS is non-strand-specific (NSS), leading to biased estimates of expression for transcripts overlapped at the antisense strand. Strand-specific (SS) protocols have recently been developed. In this study, we prepared the same RNA sample by using the SS and NSS protocols, followed by sequencing with Illumina HiSeq platform. Using real-time quantitative PCR as a standard, we first proved that the SS protocol more precisely estimates gene expressions compared with the NSS protocol, particularly for those overlapped at the antisense strand. In addition, we also showed that the sequence reads from the SS protocol are comparable with those from conventional NSS protocols in many aspects. Finally, we also mapped a fraction of sequence reads back to the antisense strand of the known genes, originally without annotated genes located. Using sequence assembly and PCR validation, we succeeded in identifying and characterizing the novel antisense genes. Our results show that the SS protocol performs more accurately than the traditional NSS protocol and can be applied in future studies.
Collapse
|
28
|
Hormozdiari F, Eskin E. Memory efficient assembly of human genome. J Bioinform Comput Biol 2015; 13:1550008. [PMID: 25603998 DOI: 10.1142/s0219720015500080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The ability to detect the genetic variations between two individuals is an essential component for genetic studies. In these studies, obtaining the genome sequence of both individuals is the first step toward variation detection problem. The emergence of high-throughput sequencing (HTS) technology has made DNA sequencing practical, and is widely used by diagnosticians to increase their knowledge about the casual factor in genetic related diseases. As HTS advances, more data are generated every day than the amount that scientists can process. Genome assembly is one of the existing methods to tackle the variation detection problem. The de Bruijn graph formulation of the assembly problem is widely used in the field. Furthermore, it is the only method which can assemble any genome in linear time. However, it requires an enormous amount of memory in order to assemble any mammalian size genome. The high demands of sequencing more individuals and the urge to assemble them are the driving forces for a memory efficient assembler. In this work, we propose a novel method which builds the de Bruijn graph while consuming lower memory. Moreover, our proposed method can reduce the memory usage by 37% compared to the existing methods. In addition, we used a real data set (chromosome 17 of A/J strain) to illustrate the performance of our method.
Collapse
Affiliation(s)
- Farhad Hormozdiari
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, USA
| | | |
Collapse
|
29
|
Abstract
MOTIVATION Next-generation sequencing technologies produce unprecedented amounts of data, leading to completely new research fields. One of these is metagenomics, the study of large-size DNA samples containing a multitude of diverse organisms. A key problem in metagenomics is to functionally and taxonomically classify the sequenced DNA, to which end the well-known BLAST program is usually used. But BLAST has dramatic resource requirements at metagenomic scales of data, imposing a high financial or technical burden on the researcher. Multiple attempts have been made to overcome these limitations and present a viable alternative to BLAST. RESULTS In this work we present Lambda, our own alternative for BLAST in the context of sequence classification. In our tests, Lambda often outperforms the best tools at reproducing BLAST's results and is the fastest compared with the current state of the art at comparable levels of sensitivity. AVAILABILITY AND IMPLEMENTATION Lambda was implemented in the SeqAn open-source C++ library for sequence analysis and is publicly available for download at http://www.seqan.de/projects/lambda. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hannes Hauswedell
- Department of Mathematics and Computer Science, Freie Universität Berlin, Takustr. 9, 14195 Berlin, Germany
| | - Jochen Singer
- Department of Mathematics and Computer Science, Freie Universität Berlin, Takustr. 9, 14195 Berlin, Germany
| | - Knut Reinert
- Department of Mathematics and Computer Science, Freie Universität Berlin, Takustr. 9, 14195 Berlin, Germany
| |
Collapse
|
30
|
Pasquier C, Clément M, Dombrovsky A, Penaud S, Da Rocha M, Rancurel C, Ledger N, Capovilla M, Robichon A. Environmentally selected aphid variants in clonality context display differential patterns of methylation in the genome. PLoS One 2014; 9:e115022. [PMID: 25551225 PMCID: PMC4281257 DOI: 10.1371/journal.pone.0115022] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2014] [Accepted: 11/17/2014] [Indexed: 11/18/2022] Open
Abstract
Heritability of acquired phenotypic traits is an adaptive evolutionary process that appears more complex than the basic allele selection guided by environmental pressure. In insects, the trans-generational transmission of epigenetic marks in clonal and/or sexual species is poorly documented. Aphids were used as a model to explore this feature because their asexual phase generates a stochastic and/or environment-oriented repertoire of variants. The a priori unchanged genome in clonal individuals prompts us to hypothesize whether covalent methyl DNA marks might be associated to the phenotypic variability and fitness selection. The full differential transcriptome between two environmentally selected clonal variants that originated from the same founder mother was mapped on the entire genomic scaffolds, in parallel with the methyl cytosine distribution. Data suggest that the assortments of heavily methylated DNA sites are distinct in these two clonal phenotypes. This might constitute an epigenetic mechanism that confers the robust adaptation of insect species to various environments involving clonal reproduction.
Collapse
Affiliation(s)
- Claude Pasquier
- Institute of Developmental Biology and Cancer, CNRS, University Nice Sophia Antipolis, Sophia Antipolis, France
| | - Mathilde Clément
- Institute Sophia Agrobiotech, INRA/CNRS/UNS, University Nice Sophia Antipolis, Sophia Antipolis, France
| | - Aviv Dombrovsky
- Institute Sophia Agrobiotech, INRA/CNRS/UNS, University Nice Sophia Antipolis, Sophia Antipolis, France
- Institute of Plant Protection, Volcani Center, Rehovot, Israel
| | | | - Martine Da Rocha
- Institute Sophia Agrobiotech, INRA/CNRS/UNS, University Nice Sophia Antipolis, Sophia Antipolis, France
| | - Corinne Rancurel
- Institute Sophia Agrobiotech, INRA/CNRS/UNS, University Nice Sophia Antipolis, Sophia Antipolis, France
| | - Neil Ledger
- Institute Sophia Agrobiotech, INRA/CNRS/UNS, University Nice Sophia Antipolis, Sophia Antipolis, France
| | - Maria Capovilla
- Institute Sophia Agrobiotech, INRA/CNRS/UNS, University Nice Sophia Antipolis, Sophia Antipolis, France
| | - Alain Robichon
- Institute Sophia Agrobiotech, INRA/CNRS/UNS, University Nice Sophia Antipolis, Sophia Antipolis, France
- * E-mail:
| |
Collapse
|
31
|
Chandran PA, Keller A, Weinmann L, Seida AA, Braun M, Andreev K, Fischer B, Horn E, Schwinn S, Junker M, Houben R, Dombrowski Y, Dietl J, Finotto S, Wölfl M, Meister G, Wischhusen J. The TGF-β-inducible miR-23a cluster attenuates IFN-γ levels and antigen-specific cytotoxicity in human CD8⁺ T cells. J Leukoc Biol 2014; 96:633-45. [PMID: 25030422 DOI: 10.1189/jlb.3a0114-025r] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Cytokine secretion and degranulation represent key components of CD8(+) T-cell cytotoxicity. While transcriptional blockade of IFN-γ and inhibition of degranulation by TGF-β are well established, we wondered whether TGF-β could also induce immune-regulatory miRNAs in human CD8(+) T cells. We used miRNA microarrays and high-throughput sequencing in combination with qRT-PCR and found that TGF-β promotes expression of the miR-23a cluster in human CD8(+) T cells. Likewise, TGF-β up-regulated expression of the cluster in CD8(+) T cells from wild-type mice, but not in cells from mice with tissue-specific expression of a dominant-negative TGF-β type II receptor. Reporter gene assays including site mutations confirmed that miR-23a specifically targets the 3'UTR of CD107a/LAMP1 mRNA, whereas the further miRNAs expressed in this cluster-namely, miR-27a and -24-target the 3'UTR of IFN-γ mRNA. Upon modulation of the miR-23a cluster by the respective miRNA antagomirs and mimics, we observed significant changes in IFN-γ expression, but only slight effects on CD107a/LAMP1 expression. Still, overexpression of the cluster attenuated the cytotoxic activity of antigen-specific CD8(+) T cells. These functional data thus reveal that the miR-23a cluster not only is induced by TGF-β, but also exerts a suppressive effect on CD8(+) T-cell effector functions, even in the absence of TGF-β signaling.
Collapse
Affiliation(s)
- P Anoop Chandran
- Graduate School of Life Sciences (GSLS), University of Würzburg, Germany; Department of Obstetrics and Gynecology
| | - Andreas Keller
- Chair for Clinical Bioinformatics, Saarland University, Saarbrücken, Germany
| | - Lasse Weinmann
- Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Ahmed Adel Seida
- Department of Obstetrics and Gynecology, Interdisciplinary Center for Clinical Research
| | - Matthias Braun
- Pediatric Hematology, Oncology, and Stem Cell Transplantation, Children's Hospital
| | - Katerina Andreev
- Laboratory of Cellular and Molecular Lung Immunology, Institute of Molecular Pneumology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany; and
| | | | - Evi Horn
- Department of Obstetrics and Gynecology
| | - Stefanie Schwinn
- Pediatric Hematology, Oncology, and Stem Cell Transplantation, Children's Hospital
| | - Markus Junker
- Department of Obstetrics and Gynecology, Interdisciplinary Center for Clinical Research
| | - Roland Houben
- Department of Dermatology, University of Würzburg Medical School, Würzburg, Germany
| | - Yvonne Dombrowski
- Department of Obstetrics and Gynecology, Interdisciplinary Center for Clinical Research
| | | | - Susetta Finotto
- Laboratory of Cellular and Molecular Lung Immunology, Institute of Molecular Pneumology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany; and
| | - Matthias Wölfl
- Pediatric Hematology, Oncology, and Stem Cell Transplantation, Children's Hospital
| | - Gunter Meister
- Max Planck Institute of Biochemistry, Martinsried, Germany; Department of Biochemistry, University of Regensburg, Germany
| | - Jörg Wischhusen
- Department of Obstetrics and Gynecology, Interdisciplinary Center for Clinical Research,
| |
Collapse
|
32
|
Abstract
BACKGROUND The alignment of short reads generated by next-generation sequencers to genomes is an important problem in many biomedical and bioinformatics applications. Although many proposed methods work very well on narrow ranges of read lengths, they tend to suffer in performance and alignment quality for reads outside of these ranges. RESULTS We introduce RandAL, a novel method that aligns DNA sequences to reference genomes. Our approach utilizes two FM indices to facilitate efficient bidirectional searching, a pruning heuristic to speed up the computing of edit distances, and most importantly, a randomized strategy that enables effective estimation of key parameters. Extensive comparisons showed that RandAL outperformed popular aligners in most instances and was unique in its consistent and accurate performance over a wide range of read lengths and error rates. The software package is publicly available at https://github.com/namsyvo/RandAL. CONCLUSIONS RandAL promises to align effectively and accurately short reads that come from a variety of technologies with different read lengths and rates of sequencing error.
Collapse
|
33
|
Poole CB, Gu W, Kumar S, Jin J, Davis PJ, Bauche D, McReynolds LA. Diversity and expression of microRNAs in the filarial parasite, Brugia malayi. PLoS One 2014; 9:e96498. [PMID: 24824352 PMCID: PMC4019659 DOI: 10.1371/journal.pone.0096498] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2013] [Accepted: 04/08/2014] [Indexed: 11/18/2022] Open
Abstract
Human filarial parasites infect an estimated 120 million people in 80 countries worldwide causing blindness and the gross disfigurement of limbs and genitals. An understanding of RNA-mediated regulatory pathways in these parasites may open new avenues for treatment. Toward this goal, small RNAs from Brugia malayi adult females, males and microfilariae were cloned for deep-sequencing. From ∼30 million sequencing reads, 145 miRNAs were identified in the B. malayi genome. Some microRNAs were validated using the p19 RNA binding protein and qPCR. B. malayi miRNAs segregate into 99 families each defined by a unique seed sequence. Sixty-one of the miRNA families are highly conserved with homologues in arthropods, vertebrates and helminths. Of those miRNAs not highly conserved, homologues of 20 B. malayi miRNA families were found in vertebrates. Nine B. malayi miRNA families appear to be filarial-specific as orthologues were not found in other organisms. The miR-2 family is the largest in B. malayi with 11 members. Analysis of the sequences shows that six members result from a recent expansion of the family. Library comparisons found that 1/3 of the B. malayi miRNAs are differentially expressed. For example, miR-71 is 5–7X more highly expressed in microfilariae than adults. Studies suggest that in C.elegans, miR-71 may enhance longevity by targeting the DAF-2 pathway. Characterization of B. malayi miRNAs and their targets will enhance our understanding of their regulatory pathways in filariads and aid in the search for novel therapeutics.
Collapse
Affiliation(s)
- Catherine B. Poole
- Division of RNA Biology, New England Biolabs, Ipswich, Massachusetts, United States of America
- Division of Parasitology, New England Biolabs, Ipswich, Massachusetts, United States of America
| | - Weifeng Gu
- Program in Molecular Medicine, University of Massachusetts Medical School, Worcester, Massachusetts, United States of America
| | - Sanjay Kumar
- Division of Parasitology, New England Biolabs, Ipswich, Massachusetts, United States of America
| | - Jingmin Jin
- Division of RNA Biology, New England Biolabs, Ipswich, Massachusetts, United States of America
| | - Paul J. Davis
- Division of Parasitology, New England Biolabs, Ipswich, Massachusetts, United States of America
| | - David Bauche
- Division of RNA Biology, New England Biolabs, Ipswich, Massachusetts, United States of America
- Cancer Research Center of Lyon, Lyon, France
| | - Larry A. McReynolds
- Division of RNA Biology, New England Biolabs, Ipswich, Massachusetts, United States of America
- Division of Parasitology, New England Biolabs, Ipswich, Massachusetts, United States of America
- * E-mail:
| |
Collapse
|
34
|
Hach F, Sarrafi I, Hormozdiari F, Alkan C, Eichler EE, Sahinalp SC. mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications. Nucleic Acids Res 2014; 42:W494-500. [PMID: 24810850 PMCID: PMC4086126 DOI: 10.1093/nar/gku370] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
High throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for processing and downstream analysis. While tools that report the ‘best’ mapping location of each read provide a fast way to process HTS data, they are not suitable for many types of downstream analysis such as structural variation detection, where it is important to report multiple mapping loci for each read. For this purpose we introduce mrsFAST-Ultra, a fast, cache oblivious, SNP-aware aligner that can handle the multi-mapping of HTS reads very efficiently. mrsFAST-Ultra improves mrsFAST, our first cache oblivious read aligner capable of handling multi-mapping reads, through new and compact index structures that reduce not only the overall memory usage but also the number of CPU operations per alignment. In fact the size of the index generated by mrsFAST-Ultra is 10 times smaller than that of mrsFAST. As importantly, mrsFAST-Ultra introduces new features such as being able to (i) obtain the best mapping loci for each read, and (ii) return all reads that have at most n mapping loci (within an error threshold), together with these loci, for any user specified n. Furthermore, mrsFAST-Ultra is SNP-aware, i.e. it can map reads to reference genome while discounting the mismatches that occur at common SNP locations provided by db-SNP; this significantly increases the number of reads that can be mapped to the reference genome. Notice that all of the above features are implemented within the index structure and are not simple post-processing steps and thus are performed highly efficiently. Finally, mrsFAST-Ultra utilizes multiple available cores and processors and can be tuned for various memory settings. Our results show that mrsFAST-Ultra is roughly five times faster than its predecessor mrsFAST. In comparison to newly enhanced popular tools such as Bowtie2, it is more sensitive (it can report 10 times or more mappings per read) and much faster (six times or more) in the multi-mapping mode. Furthermore, mrsFAST-Ultra has an index size of 2GB for the entire human reference genome, which is roughly half of that of Bowtie2. mrsFAST-Ultra is open source and it can be accessed at http://mrsfast.sourceforge.net.
Collapse
Affiliation(s)
- Faraz Hach
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada, V5A 1S6
| | - Iman Sarrafi
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada, V5A 1S6
| | - Farhad Hormozdiari
- Computer Science Department, University of California, Los Angeles, CA, USA, 90095
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, 06800 Ankara, Turkey
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington, Seattle, WA, USA, 98195
| | - S Cenk Sahinalp
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada, V5A 1S6 School of Informatics and Computing, Indiana University, Bloomington, IN, USA, 47405
| |
Collapse
|
35
|
Evaluation and comparison of multiple aligners for next-generation sequencing data analysis. BIOMED RESEARCH INTERNATIONAL 2014; 2014:309650. [PMID: 24779008 PMCID: PMC3980841 DOI: 10.1155/2014/309650] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/17/2013] [Accepted: 02/04/2014] [Indexed: 12/23/2022]
Abstract
Next-generation sequencing (NGS) technology has rapidly advanced and generated the massive data volumes. To align and map the NGS data, biologists often randomly select a number of aligners without concerning their suitable feature, high performance, and high accuracy as well as sequence variations and polymorphisms existing on reference genome. This study aims to systematically evaluate and compare the capability of multiple aligners for NGS data analysis. To explore this capability, we firstly performed alignment algorithms comparison and classification. We further used long-read and short-read datasets from both real-life and in silico NGS data for comparative analysis and evaluation of these aligners focusing on three criteria, namely, application-specific alignment feature, computational performance, and alignment accuracy. Our study demonstrated the overall evaluation and comparison of multiple aligners for NGS data analysis. This serves as an important guiding resource for biologists to gain further insight into suitable selection of aligners for specific and broad applications.
Collapse
|
36
|
The effects of carbon dioxide and temperature on microRNA expression in Arabidopsis development. Nat Commun 2014; 4:2145. [PMID: 23900278 DOI: 10.1038/ncomms3145] [Citation(s) in RCA: 99] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2012] [Accepted: 06/14/2013] [Indexed: 11/09/2022] Open
Abstract
Elevated levels of CO2 and temperature can both affect plant growth and development, but the signalling pathways regulating these processes are still obscure. MicroRNAs function to silence gene expression, and environmental stresses can alter their expressions. Here we identify, using the small RNA-sequencing method, microRNAs that change significantly in expression by either doubling the atmospheric CO2 concentration or by increasing temperature 3-6 °C. Notably, nearly all CO2-influenced microRNAs are affected inversely by elevated temperature. Using the RNA-sequencing method, we determine strongly correlated expression changes between miR156/157 and miR172, and their target transcription factors under elevated CO2 concentration. Similar correlations are also found for microRNAs acting in auxin-signalling, stress responses and potential cell wall carbohydrate synthesis. Our results demonstrate that both CO2 and temperature alter microRNA expression to affect Arabidopsis growth and development, and miR156/157- and miR172-regulated transcriptional network might underlie the onset of early flowering induced by increasing CO2.
Collapse
|
37
|
Grunert M, Dorn C, Schueler M, Dunkel I, Schlesinger J, Mebus S, Alexi-Meskishvili V, Perrot A, Wassilew K, Timmermann B, Hetzer R, Berger F, Sperling SR. Rare and private variations in neural crest, apoptosis and sarcomere genes define the polygenic background of isolated Tetralogy of Fallot. Hum Mol Genet 2014; 23:3115-28. [PMID: 24459294 DOI: 10.1093/hmg/ddu021] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Tetralogy of Fallot (TOF) is the most common cyanotic congenital heart disease. Its genetic basis is demonstrated by an increased recurrence risk in siblings and familial cases. However, the majority of TOF are sporadic, isolated cases of undefined origin and it had been postulated that rare and private autosomal variations in concert define its genetic basis. To elucidate this hypothesis, we performed a multilevel study using targeted re-sequencing and whole-transcriptome profiling. We developed a novel concept based on a gene's mutation frequency to unravel the polygenic origin of TOF. We show that isolated TOF is caused by a combination of deleterious private and rare mutations in genes essential for apoptosis and cell growth, the assembly of the sarcomere as well as for the neural crest and secondary heart field, the cellular basis of the right ventricle and its outflow tract. Affected genes coincide in an interaction network with significant disturbances in expression shared by cases with a mutually affected TOF gene. The majority of genes show continuous expression during adulthood, which opens a new route to understand the diversity in the long-term clinical outcome of TOF cases. Our findings demonstrate that TOF has a polygenic origin and that understanding the genetic basis can lead to novel diagnostic and therapeutic routes. Moreover, the novel concept of the gene mutation frequency is a versatile measure and can be applied to other open genetic disorders.
Collapse
Affiliation(s)
- Marcel Grunert
- Group of Cardiovascular Genetics, Department of Vertebrate Genomics and Cardiovascular Genetics, Experimental and Clinical Research Center, Charité-Universitätsmedizin Berlin and Max Delbrück Center (MDC) for Molecular Medicine, Berlin 13125, Germany
| | - Cornelia Dorn
- Group of Cardiovascular Genetics, Department of Vertebrate Genomics and Cardiovascular Genetics, Experimental and Clinical Research Center, Charité-Universitätsmedizin Berlin and Max Delbrück Center (MDC) for Molecular Medicine, Berlin 13125, Germany Department of Biology, Chemistry and Pharmacy, Free University of Berlin, Berlin 14195, Germany
| | - Markus Schueler
- Group of Cardiovascular Genetics, Department of Vertebrate Genomics and Cardiovascular Genetics, Experimental and Clinical Research Center, Charité-Universitätsmedizin Berlin and Max Delbrück Center (MDC) for Molecular Medicine, Berlin 13125, Germany
| | - Ilona Dunkel
- Group of Cardiovascular Genetics, Department of Vertebrate Genomics and
| | - Jenny Schlesinger
- Group of Cardiovascular Genetics, Department of Vertebrate Genomics and Cardiovascular Genetics, Experimental and Clinical Research Center, Charité-Universitätsmedizin Berlin and Max Delbrück Center (MDC) for Molecular Medicine, Berlin 13125, Germany
| | - Siegrun Mebus
- Department of Pediatric Cardiology, German Heart Institute Berlin and Department of Pediatric Cardiology, Charité-Universitätsmedizin Berlin, Berlin 13353, Germany
| | | | - Andreas Perrot
- Cardiovascular Genetics, Experimental and Clinical Research Center, Charité-Universitätsmedizin Berlin and Max Delbrück Center (MDC) for Molecular Medicine, Berlin 13125, Germany
| | | | - Bernd Timmermann
- Next Generation Service Group, Max Planck Institute for Molecular Genetics, Berlin 14195, Germany
| | | | - Felix Berger
- Department of Pediatric Cardiology, German Heart Institute Berlin and Department of Pediatric Cardiology, Charité-Universitätsmedizin Berlin, Berlin 13353, Germany
| | - Silke R Sperling
- Group of Cardiovascular Genetics, Department of Vertebrate Genomics and Cardiovascular Genetics, Experimental and Clinical Research Center, Charité-Universitätsmedizin Berlin and Max Delbrück Center (MDC) for Molecular Medicine, Berlin 13125, Germany Department of Biology, Chemistry and Pharmacy, Free University of Berlin, Berlin 14195, Germany
| |
Collapse
|
38
|
|
39
|
Busse CE, Czogiel I, Braun P, Arndt PF, Wardemann H. Single-cell based high-throughput sequencing of full-length immunoglobulin heavy and light chain genes. Eur J Immunol 2013; 44:597-603. [PMID: 24114719 DOI: 10.1002/eji.201343917] [Citation(s) in RCA: 94] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2013] [Revised: 08/27/2013] [Accepted: 09/19/2013] [Indexed: 11/09/2022]
Abstract
Single-cell PCR and sequencing of full-length Ig heavy (Igh) and Igk and Igl light chain genes is a powerful tool to measure the diversity of antibody repertoires and allows the functional assessment of B-cell responses through direct Ig gene cloning and the generation of recombinant mAbs. However, the current methodology is not high-throughput compatible. Here we developed a two-dimensional bar-coded primer matrix to combine Igh and Igk/Igl chain gene single-cell PCR with next-generation sequencing for the parallel analysis of the antibody repertoire of over 46 000 individual B cells. Our approach provides full-length Igh and corresponding Igk/Igl chain gene-sequence information and permits the accurate correction of sequencing errors by consensus building. The use of indexed cell sorting for the isolation of single B cells enables the integration of flow cytometry and Ig gene sequence information. The strategy is fully compatible with established protocols for direct antibody gene cloning and expression and therefore advances over previously described high-throughput approaches to assess antibody repertoires at the single-cell level.
Collapse
Affiliation(s)
- Christian E Busse
- Research Group Molecular Immunology, Max Planck Institute for Infection Biology, Berlin, Germany
| | | | | | | | | |
Collapse
|
40
|
Abstract
Background Read alignment is a computational bottleneck in some sequencing projects. Most of the existing software packages for read alignment are based on two algorithmic approaches: prefix-trees and hash-tables. We propose a new approach to read alignment using random permutations of strings. Results We present a prototype implementation and experiments performed with simulated and real reads of human DNA. Our experiments indicate that this permutations-based prototype is several times faster than comparable programs for fast read alignment and that it aligns more reads correctly. Conclusions This approach may lead to improved speed, sensitivity, and accuracy in read alignment. The algorithm can also be used for specialized alignment applications and it can be extended to other related problems, such as assembly. More information: http://alignment.commons.yale.edu
Collapse
Affiliation(s)
- Roy Lederman
- Applied Mathematics Program, Yale University, 51 Prospect St., New Haven, CT 06511, USA.
| |
Collapse
|
41
|
Hatem A, Bozdağ D, Toland AE, Çatalyürek ÜV. Benchmarking short sequence mapping tools. BMC Bioinformatics 2013; 14:184. [PMID: 23758764 PMCID: PMC3694458 DOI: 10.1186/1471-2105-14-184] [Citation(s) in RCA: 121] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2012] [Accepted: 05/28/2013] [Indexed: 01/21/2023] Open
Abstract
BACKGROUND The development of next-generation sequencing instruments has led to the generation of millions of short sequences in a single run. The process of aligning these reads to a reference genome is time consuming and demands the development of fast and accurate alignment tools. However, the current proposed tools make different compromises between the accuracy and the speed of mapping. Moreover, many important aspects are overlooked while comparing the performance of a newly developed tool to the state of the art. Therefore, there is a need for an objective evaluation method that covers all the aspects. In this work, we introduce a benchmarking suite to extensively analyze sequencing tools with respect to various aspects and provide an objective comparison. RESULTS We applied our benchmarking tests on 9 well known mapping tools, namely, Bowtie, Bowtie2, BWA, SOAP2, MAQ, RMAP, GSNAP, Novoalign, and mrsFAST (mrFAST) using synthetic data and real RNA-Seq data. MAQ and RMAP are based on building hash tables for the reads, whereas the remaining tools are based on indexing the reference genome. The benchmarking tests reveal the strengths and weaknesses of each tool. The results show that no single tool outperforms all others in all metrics. However, Bowtie maintained the best throughput for most of the tests while BWA performed better for longer read lengths. The benchmarking tests are not restricted to the mentioned tools and can be further applied to others. CONCLUSION The mapping process is still a hard problem that is affected by many factors. In this work, we provided a benchmarking suite that reveals and evaluates the different factors affecting the mapping process. Still, there is no tool that outperforms all of the others in all the tests. Therefore, the end user should clearly specify his needs in order to choose the tool that provides the best results.
Collapse
Affiliation(s)
- Ayat Hatem
- Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH, USA
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | - Doruk Bozdağ
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | - Amanda E Toland
- Department of Molecular Virology, Immunology and Medical Genetics, The Ohio State University, Columbus, OH, USA
| | - Ümit V Çatalyürek
- Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH, USA
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| |
Collapse
|
42
|
Giese SH, Zickmann F, Renard BY. Specificity control for read alignments using an artificial reference genome-guided false discovery rate. ACTA ACUST UNITED AC 2013; 30:9-16. [PMID: 23685787 DOI: 10.1093/bioinformatics/btt255] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Accurate estimation, comparison and evaluation of read mapping error rates is a crucial step in the processing of next-generation sequencing data, as further analysis steps and interpretation assume the correctness of the mapping results. Current approaches are either focused on sensitivity estimation and thereby disregard specificity or are based on read simulations. Although continuously improving, read simulations are still prone to introduce a bias into the mapping error quantitation and cannot capture all characteristics of an individual dataset. RESULTS We introduce ARDEN (artificial reference driven estimation of false positives in next-generation sequencing data), a novel benchmark method that estimates error rates of read mappers based on real experimental reads, using an additionally generated artificial reference genome. It allows a dataset-specific computation of error rates and the construction of a receiver operating characteristic curve. Thereby, it can be used for optimization of parameters for read mappers, selection of read mappers for a specific problem or for filtering alignments based on quality estimation. The use of ARDEN is demonstrated in a general read mapper comparison, a parameter optimization for one read mapper and an application example in single-nucleotide polymorphism discovery with a significant reduction in the number of false positive identifications. AVAILABILITY The ARDEN source code is freely available at http://sourceforge.net/projects/arden/.
Collapse
Affiliation(s)
- Sven H Giese
- Research Group Bioinformatics (NG4), Robert Koch-Institut, Nordufer 20, 13353 Berlin, Germany
| | | | | |
Collapse
|
43
|
Mahmud MP, Wiedenhoeft J, Schliep A. Indel-tolerant read mapping with trinucleotide frequencies using cache-oblivious kd-trees. Bioinformatics 2013; 28:i325-i332. [PMID: 22962448 PMCID: PMC3436807 DOI: 10.1093/bioinformatics/bts380] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Motivation: Mapping billions of reads from next generation sequencing experiments to reference genomes is a crucial task, which can require hundreds of hours of running time on a single CPU even for the fastest known implementations. Traditional approaches have difficulties dealing with matches of large edit distance, particularly in the presence of frequent or large insertions and deletions (indels). This is a serious obstacle both in determining the spectrum and abundance of genetic variations and in personal genomics. Results: For the first time, we adopt the approximate string matching paradigm of geometric embedding to read mapping, thus rephrasing it to nearest neighbor queries in a q-gram frequency vector space. Using the L1 distance between frequency vectors has the benefit of providing lower bounds for an edit distance with affine gap costs. Using a cache-oblivious kd-tree, we realize running times, which match the state-of-the-art. Additionally, running time and memory requirements are about constant for read lengths between 100 and 1000 bp. We provide a first proof-of-concept that geometric embedding is a promising paradigm for read mapping and that L1 distance might serve to detect structural variations. TreQ, our initial implementation of that concept, performs more accurate than many popular read mappers over a wide range of structural variants. Availability and implementation: TreQ will be released under the GNU Public License (GPL), and precomputed genome indices will be provided for download at http://treq.sf.net. Contact:pavelm@cs.rutgers.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Md Pavel Mahmud
- Department of Computer Science, Rutgers University, New Jersey, USA.
| | | | | |
Collapse
|
44
|
Fimereli D, Detours V, Konopka T. TriageTools: tools for partitioning and prioritizing analysis of high-throughput sequencing data. Nucleic Acids Res 2013; 41:e86. [PMID: 23408855 PMCID: PMC3627586 DOI: 10.1093/nar/gkt094] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
High-throughput sequencing is becoming a popular research tool but carries with it considerable costs in terms of computation time, data storage and bandwidth. Meanwhile, some research applications focusing on individual genes or pathways do not necessitate processing of a full sequencing dataset. Thus, it is desirable to partition a large dataset into smaller, manageable, but relevant pieces. We present a toolkit for partitioning raw sequencing data that includes a method for extracting reads that are likely to map onto pre-defined regions of interest. We show the method can be used to extract information about genes of interest from DNA or RNA sequencing samples in a fraction of the time and disk space required to process and store a full dataset. We report speedup factors between 2.6 and 96, depending on settings and samples used. The software is available at http://www.sourceforge.net/projects/triagetools/.
Collapse
Affiliation(s)
- Danai Fimereli
- IRIBHM, Université Libre de Bruxelles, 808 Route de Lennick, 1070 Brussels, Belgium
| | | | | |
Collapse
|
45
|
Nestorov P, Battke F, Levesque MP, Gerberding M. The maternal transcriptome of the crustacean Parhyale hawaiensis is inherited asymmetrically to invariant cell lineages of the ectoderm and mesoderm. PLoS One 2013; 8:e56049. [PMID: 23418507 PMCID: PMC3572164 DOI: 10.1371/journal.pone.0056049] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2012] [Accepted: 01/04/2013] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND The embryo of the crustacean Parhyale hawaiensis has a total, unequal and invariant early cleavage pattern. It specifies cell fates earlier than other arthropods, including Drosophila, as individual blastomeres of the 8-cell stage are allocated to the germ layers and the germline. Furthermore, the 8-cell stage is amenable to embryological manipulations. These unique features make Parhyale a suitable system for elucidating germ layer specification in arthropods. Since asymmetric localization of maternally provided RNA is a widespread mechanism to specify early cell fates, we asked whether this is also true for Parhyale. A candidate gene approach did not find RNAs that are asymmetrically distributed at the 8-cell stage. Therefore, we designed a high-density microarray from 9400 recently sequenced ESTs (1) to identify maternally provided RNAs and (2) to find RNAs that are differentially distributed among cells of the 8-cell stage. RESULTS Maternal-zygotic transition takes place around the 32-cell stage, i.e. after the specification of germ layers. By comparing a pool of RNAs from early embryos without zygotic transcription to zygotic RNAs of the germband, we found that more than 10% of the targets on the array were enriched in the maternal transcript pool. A screen for asymmetrically distributed RNAs at the 8-cell stage revealed 129 transcripts, from which 50% are predominantly expressed in the early embryonic stages. Finally, we performed knockdown experiments for two of these genes and observed cell-fate-related defects of embryonic development. CONCLUSIONS In contrast to Drosophila, the four primary germ layer cell lineages in Parhyale are specified during the maternal control phase of the embryo. A key step in this process is the asymmetric distribution of a large number of maternal RNAs to the germ layer progenitor cells.
Collapse
Affiliation(s)
- Peter Nestorov
- Max Planck Institut für Entwicklungsbiologie, Tübingen, Germany
| | - Florian Battke
- Center for Bioinformatics, University of Tübingen, Tübingen, Germany
| | | | | |
Collapse
|
46
|
Xin H, Lee D, Hormozdiari F, Yedkar S, Mutlu O, Alkan C. Accelerating read mapping with FastHASH. BMC Genomics 2013; 14 Suppl 1:S13. [PMID: 23369189 PMCID: PMC3549798 DOI: 10.1186/1471-2164-14-s1-s13] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
With the introduction of next-generation sequencing (NGS) technologies, we are facing an exponential increase in the amount of genomic sequence data. The success of all medical and genetic applications of next-generation sequencing critically depends on the existence of computational techniques that can process and analyze the enormous amount of sequence data quickly and accurately. Unfortunately, the current read mapping algorithms have difficulties in coping with the massive amounts of data generated by NGS.We propose a new algorithm, FastHASH, which drastically improves the performance of the seed-and-extend type hash table based read mapping algorithms, while maintaining the high sensitivity and comprehensiveness of such methods. FastHASH is a generic algorithm compatible with all seed-and-extend class read mapping algorithms. It introduces two main techniques, namely Adjacency Filtering, and Cheap K-mer Selection.We implemented FastHASH and merged it into the codebase of the popular read mapping program, mrFAST. Depending on the edit distance cutoffs, we observed up to 19-fold speedup while still maintaining 100% sensitivity and high comprehensiveness.
Collapse
Affiliation(s)
- Hongyi Xin
- Depts. of Computer Science and Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | | | | | | | | | | |
Collapse
|
47
|
Veeneman BA, Iyer MK, Chinnaiyan AM. Oculus: faster sequence alignment by streaming read compression. BMC Bioinformatics 2012; 13:297. [PMID: 23148484 PMCID: PMC3534618 DOI: 10.1186/1471-2105-13-297] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2012] [Accepted: 11/01/2012] [Indexed: 01/17/2023] Open
Abstract
Background Despite significant advancement in alignment algorithms, the exponential growth of nucleotide sequencing throughput threatens to outpace bioinformatic analysis. Computation may become the bottleneck of genome analysis if growing alignment costs are not mitigated by further improvement in algorithms. Much gain has been gleaned from indexing and compressing alignment databases, but many widely used alignment tools process input reads sequentially and are oblivious to any underlying redundancy in the reads themselves. Results Here we present Oculus, a software package that attaches to standard aligners and exploits read redundancy by performing streaming compression, alignment, and decompression of input sequences. This nearly lossless process (> 99.9%) led to alignment speedups of up to 270% across a variety of data sets, while requiring a modest amount of memory. We expect that streaming read compressors such as Oculus could become a standard addition to existing RNA-Seq and ChIP-Seq alignment pipelines, and potentially other applications in the future as throughput increases. Conclusions Oculus efficiently condenses redundant input reads and wraps existing aligners to provide nearly identical SAM output in a fraction of the aligner runtime. It includes a number of useful features, such as tunable performance and fidelity options, compatibility with FASTA or FASTQ files, and adherence to the SAM format. The platform-independent C++ source code is freely available online, at http://code.google.com/p/oculus-bio.
Collapse
Affiliation(s)
- Brendan A Veeneman
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | | | | |
Collapse
|
48
|
Wilm A, Aw PPK, Bertrand D, Yeo GHT, Ong SH, Wong CH, Khor CC, Petric R, Hibberd ML, Nagarajan N. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res 2012; 40:11189-201. [PMID: 23066108 PMCID: PMC3526318 DOI: 10.1093/nar/gks918] [Citation(s) in RCA: 887] [Impact Index Per Article: 73.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
The study of cell-population heterogeneity in a range of biological systems, from viruses to bacterial isolates to tumor samples, has been transformed by recent advances in sequencing throughput. While the high-coverage afforded can be used, in principle, to identify very rare variants in a population, existing ad hoc approaches frequently fail to distinguish true variants from sequencing errors. We report a method (LoFreq) that models sequencing run-specific error rates to accurately call variants occurring in <0.05% of a population. Using simulated and real datasets (viral, bacterial and human), we show that LoFreq has near-perfect specificity, with significantly improved sensitivity compared with existing methods and can efficiently analyze deep Illumina sequencing datasets without resorting to approximations or heuristics. We also present experimental validation for LoFreq on two different platforms (Fluidigm and Sequenom) and its application to call rare somatic variants from exome sequencing datasets for gastric cancer. Source code and executables for LoFreq are freely available at http://sourceforge.net/projects/lofreq/.
Collapse
Affiliation(s)
- Andreas Wilm
- Genome Institute of Singapore, 60 Biopolis Street, Genome, #02-01, Singapore 138672, Singapore
| | | | | | | | | | | | | | | | | | | |
Collapse
|
49
|
Fonseca NA, Rung J, Brazma A, Marioni JC. Tools for mapping high-throughput sequencing data. Bioinformatics 2012; 28:3169-77. [DOI: 10.1093/bioinformatics/bts605] [Citation(s) in RCA: 207] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
|
50
|
Chaisson MJ, Tesler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 2012; 13:238. [PMID: 22988817 PMCID: PMC3572422 DOI: 10.1186/1471-2105-13-238] [Citation(s) in RCA: 795] [Impact Index Per Article: 66.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2012] [Accepted: 09/17/2012] [Indexed: 11/17/2022] Open
Abstract
Background Recent methods have been developed to perform high-throughput sequencing of DNA by Single Molecule Sequencing (SMS). While Next-Generation sequencing methods may produce reads up to several hundred bases long, SMS sequencing produces reads up to tens of kilobases long. Existing alignment methods are either too inefficient for high-throughput datasets, or not sensitive enough to align SMS reads, which have a higher error rate than Next-Generation sequencing. Results We describe the method BLASR (Basic Local Alignment with Successive Refinement) for mapping Single Molecule Sequencing (SMS) reads that are thousands of bases long, with divergence between the read and genome dominated by insertion and deletion error. The method is benchmarked using both simulated reads and reads from a bacterial sequencing project. We also present a combinatorial model of sequencing error that motivates why our approach is effective. Conclusions The results indicate that it is possible to map SMS reads with high accuracy and speed. Furthermore, the inferences made on the mapability of SMS reads using our combinatorial model of sequencing error are in agreement with the mapping accuracy demonstrated on simulated reads.
Collapse
Affiliation(s)
- Mark J Chaisson
- Department of Mathematics, University of California, San Diego, 9500 Gilman Dr, CA, La Jolla, USA
| | | |
Collapse
|