1
|
Mukherjee K, Dole-Muinos D, Ajayi A, Rossi M, Prosperi M, Boucher C. Finding Overlapping Rmaps via Clustering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; PP:1-1. [PMID: 34890332 DOI: 10.1109/tcbb.2021.3132534] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Optical mapping has been largely automated, and first produces single molecule restriction maps, called Rmaps, which are assembled to generate genome wide optical maps. Since the location and orientation of each Rmap is unknown, the first problem in the analysis of this data is finding related Rmaps, i.e., pairs of Rmaps that share the same orientation and have significant overlap in their genomic location. Although heuristics for identifying related Rmaps exist, they all require quantization of the data which leads to a loss in the precision. In this paper, we propose a Gaussian mixture modelling clustering based method, which we refer to as O, that finds overlapping Rmaps without quantization. Using both simulated and real datasets, we show that OMclust substantially improves the precision (from 48.3% to 73.3%) over the state-of-the art methods while also reducing CPU time and memory consumption. Further, we integrated OMclust into the error correction methods (Elmeri and Comet) to demonstrate the increase in the performance of these methods. When OMclust was combined with Comet to error correct Rmap data generated from human DNA, it was able to error correct close to 3x more Ramps, and reduced the CPU time by more than 35x.
Collapse
|
2
|
Abstract
In genomics, optical mapping technology provides long-range contiguity information to improve genome sequence assemblies and detect structural variation. Originally a laborious manual process, Bionano Genomics platforms now offer high-throughput, automated optical mapping based on chips packed with nanochannels through which unwound DNA is guided and the fluorescent DNA backbone and specific restriction sites are recorded. Although the raw image data obtained is of high quality, the processing and assembly software accompanying the platforms is closed source and does not seem to make full use of data, labeling approximately half of the measured signals as unusable. Here we introduce two new software tools, independent of Bionano Genomics software, to extract and process molecules from raw images (OptiScan) and to perform molecule-to-molecule and molecule-to-reference alignments using a novel signal-based approach (OptiMap). We demonstrate that the molecules detected by OptiScan can yield better assemblies, and that the approach taken by OptiMap results in higher use of molecules from the raw data. These tools lay the foundation for a suite of open-source methods to process and analyze high-throughput optical mapping data. The Python implementations of the OptiTools are publicly available through http://www.bif.wur.nl/.
Collapse
|
3
|
Walve R, Puglisi SJ, Salmela L. Space-Efficient Indexing of Spaced Seeds for Accurate Overlap Computation of Raw Optical Mapping Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; PP:2454-2462. [PMID: 34057895 DOI: 10.1109/tcbb.2021.3085086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
A key problem in processing raw optical mapping data (Rmaps) is finding Rmaps originating from the same genomic region. These sets of related Rmaps can be used to correct errors in Rmap data, and to find overlaps between Rmaps to assemble consensus optical maps. Previous Rmap overlap aligners are computationally very expensive and do not scale to large eukaryotic data sets. We present Selkie, an Rmap overlap aligner based on a spaced (l,k)-mer index which was pioneered in the Rmap error correction tool Elmeri. Here we present a space efficient version of the index which is twice as fast as prior art while using just a quarter of the memory on a human data set. Moreover, our index can be used for filtering candidates for Rmap overlap computation, whereas Elmeri used the index only for error correction of Rmaps. By combining our filtering of Rmaps with the exhaustive, but highly accurate, algorithm of Valouev et al. (2006), Selkie maintains or increases the accuracy of finding overlapping Rmaps on a bacterial dataset while being at least four times faster. Furthermore, for finding overlaps in a human dataset, Selkie is up to two orders of magnitude faster than previous methods.
Collapse
|
4
|
Mukherjee K, Rossi M, Salmela L, Boucher C. Fast and efficient Rmap assembly using the Bi-labelled de Bruijn graph. Algorithms Mol Biol 2021; 16:6. [PMID: 34034751 PMCID: PMC8147420 DOI: 10.1186/s13015-021-00182-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2021] [Accepted: 04/13/2021] [Indexed: 11/10/2022] Open
Abstract
Genome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there are very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary software that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (Proc Natl Acad Sci USA 103(43):15770-15775, 2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algorithm behind the proprietary method, Bionano Genomics' Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as RMAPPER, and compare its performance against the assembler of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770-15775, 2006) and Solve by Bionano Genomics on data from three genomes: E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was able to successfully run on all three genomes. The method of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770-15775, 2006) only successfully ran on E. coli. Moreover, on the human genome RMAPPER was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies. Our software, RMAPPER is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/Rmapper .
Collapse
|
5
|
Raeisi Dehkordi S, Luebeck J, Bafna V. FaNDOM: Fast nested distance-based seeding of optical maps. PATTERNS (NEW YORK, N.Y.) 2021; 2:100248. [PMID: 34027500 PMCID: PMC8134938 DOI: 10.1016/j.patter.2021.100248] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Revised: 03/08/2021] [Accepted: 04/01/2021] [Indexed: 12/25/2022]
Abstract
Optical mapping (OM) provides single-molecule readouts of fluorescently labeled sequence motifs on long fragments of DNA, resolved to nucleotide-level coordinates. With the advent of microfluidic technologies for analysis of DNA molecules, it is possible to inexpensively generate long OM data ( > 150 kbp) at high coverage. In addition to scaffolding for de novo assembly, OM data can be aligned to a reference genome for identification of genomic structural variants. We introduce FaNDOM (Fast Nested Distance Seeding of Optical Maps)-an optical map alignment tool that greatly reduces the search space of the alignment process. On four benchmark human datasets, FaNDOM was significantly (4-14×) faster than competing tools while maintaining comparable sensitivity and specificity. We used FaNDOM to map variants in three cancer cell lines and identified many biologically interesting structural variants, including deletions, duplications, gene fusions and gene-disrupting rearrangements. FaNDOM is publicly available at https://github.com/jluebeck/FaNDOM.
Collapse
Affiliation(s)
- Siavash Raeisi Dehkordi
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA 92093, USA
| | - Jens Luebeck
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA 92093, USA
- Bioinformatics & Systems Biology Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA
| | - Vineet Bafna
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA 92093, USA
| |
Collapse
|
6
|
Leung AKY, Liu MCJ, Li L, Lai YYY, Chu C, Kwok PY, Ho PL, Yip KY, Chan TF. OMMA enables population-scale analysis of complex genomic features and phylogenomic relationships from nanochannel-based optical maps. Gigascience 2019; 8:giz079. [PMID: 31289833 PMCID: PMC6615982 DOI: 10.1093/gigascience/giz079] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2018] [Revised: 01/13/2019] [Accepted: 06/16/2019] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND Optical mapping is an emerging technology that complements sequencing-based methods in genome analysis. It is widely used in improving genome assemblies and detecting structural variations by providing information over much longer (up to 1 Mb) reads. Current standards in optical mapping analysis involve assembling optical maps into contigs and aligning them to a reference, which is limited to pairwise comparison and becomes bias-prone when analyzing multiple samples. FINDINGS We present a new method, OMMA, that extends optical mapping to the study of complex genomic features by simultaneously interrogating optical maps across many samples in a reference-independent manner. OMMA captures and characterizes complex genomic features, e.g., multiple haplotypes, copy number variations, and subtelomeric structures when applied to 154 human samples across the 26 populations sequenced in the 1000 Genomes Project. For small genomes such as pathogenic bacteria, OMMA accurately reconstructs the phylogenomic relationships and identifies functional elements across 21 Acinetobacter baumannii strains. CONCLUSIONS With the increasing data throughput of optical mapping system, the use of this technology in comparative genome analysis across many samples will become feasible. OMMA is a timely solution that can address such computational need. The OMMA software is available at https://github.com/TF-Chan-Lab/OMTools.
Collapse
Affiliation(s)
| | - Melissa Chun-Jiao Liu
- Carol Yu Center for Infection and Department of Microbiology, The University of Hong Kong, Queen Mary Hospital, Pok Fu Lam, Hong Kong
| | - Le Li
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong
| | - Yvonne Yuk-Yin Lai
- Cardiovascular Research Institute, University of California, San Francisco, CA 94153, USA
- Institute of Human Genetics, University of California, San Francisco, CA 94153, USA
| | - Catherine Chu
- Cardiovascular Research Institute, University of California, San Francisco, CA 94153, USA
- Institute of Human Genetics, University of California, San Francisco, CA 94153, USA
| | - Pui-Yan Kwok
- Cardiovascular Research Institute, University of California, San Francisco, CA 94153, USA
- Institute of Human Genetics, University of California, San Francisco, CA 94153, USA
| | - Pak-Leung Ho
- Carol Yu Center for Infection and Department of Microbiology, The University of Hong Kong, Queen Mary Hospital, Pok Fu Lam, Hong Kong
| | - Kevin Y Yip
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong
- Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, Hong Kong
| | - Ting-Fung Chan
- School of Life Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong
- State Key Laboratory of Agrobiotechnology, The Chinese University of Hong Kong, Shatin, Hong Kong
- Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, Hong Kong
| |
Collapse
|
7
|
Leung AKY, Kwok TP, Wan R, Xiao M, Kwok PY, Yip KY, Chan TF. OMBlast: alignment tool for optical mapping using a seed-and-extend approach. Bioinformatics 2018; 33:311-319. [PMID: 28172448 PMCID: PMC5409310 DOI: 10.1093/bioinformatics/btw620] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2015] [Revised: 08/31/2016] [Accepted: 09/26/2016] [Indexed: 11/15/2022] Open
Abstract
Motivation Optical mapping is a technique for capturing fluorescent signal patterns of long DNA molecules (in the range of 0.1–1 Mbp). Recently, it has been complementing the widely used short-read sequencing technology by assisting with scaffolding and detecting large and complex structural variations (SVs). Here, we introduce a fast, robust and accurate tool called OMBlast for aligning optical maps, the set of signal locations on the molecules generated from optical mapping. Our method is based on the seed-and-extend approach from sequence alignment, with modifications specific to optical mapping. Results Experiments with both synthetic and our real data demonstrate that OMBlast has higher accuracy and faster mapping speed than existing alignment methods. Our tool also shows significant improvement when aligning data with SVs. Availability and Implementation OMBlast is implemented for Java 1.7 and is released under a GPL license. OMBlast can be downloaded from https://github.com/aldenleung/OMBlast and run directly on machines equipped with a Java virtual machine. Supplementary information Supplementary data are available at Bioinformatics online
Collapse
Affiliation(s)
| | - Tsz-Piu Kwok
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China
| | - Raymond Wan
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong, China
| | - Ming Xiao
- School of Biomedical Engineering, Science and Health System, Drexel University, Philadelphia, PA, USA
| | - Pui-Yan Kwok
- Institute for Human Genetics.,Cardiovascular Research Institute, University of California San Francisco, San Francisco, CA, USA
| | - Kevin Y Yip
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China.,Hong Kong Bioinformatics Centre
| | - Ting-Fung Chan
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong, China,Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China.,Centre for Soybean Research, State Key Laboratory of Agrobiotechnology, The Chinese University of Hong Kong, Hong Kong, China.,Hong Kong Bioinformatics Centre
| |
Collapse
|
8
|
Towards a More Accurate Error Model for BioNano Optical Maps. BIOINFORMATICS RESEARCH AND APPLICATIONS 2016. [DOI: 10.1007/978-3-319-38782-6_6] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
9
|
Mendelowitz LM, Schwartz DC, Pop M. Maligner: a fast ordered restriction map aligner. Bioinformatics 2015; 32:1016-22. [PMID: 26637292 DOI: 10.1093/bioinformatics/btv711] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2015] [Accepted: 12/01/2015] [Indexed: 12/28/2022] Open
Abstract
MOTIVATION The Optical Mapping System discovers structural variants and potentiates sequence assembly of genomes via scaffolding and comparisons that globally validate or correct sequence assemblies. Despite its utility, there are few publicly available tools for aligning optical mapping datasets. RESULTS Here we present software, named 'Maligner', for the alignment of both single molecule restriction maps (Rmaps) and in silico restriction maps of sequence contigs to a reference. Maligner provides two modes of alignment: an efficient, sensitive dynamic programming implementation that scales to large eukaryotic genomes, and a faster indexed based implementation for finding alignments with unmatched sites in the reference but not the query. We compare our software to other publicly available tools on Rmap datasets and show that Maligner finds more correct alignments in comparable runtime. Lastly, we introduce the M-Score statistic for normalizing alignment scores across restriction maps and demonstrate its utility for selecting high quality alignments. AVAILABILITY AND IMPLEMENTATION The Maligner software is written in C ++ and is available at https://github.com/LeeMendelowitz/maligner under the GNU General Public License. CONTACT mpop@umiacs.umd.edu.
Collapse
Affiliation(s)
- Lee M Mendelowitz
- Center for Bioinformatics and Computational Biology, Applied Math & Statistics, and Scientific Computation
| | - David C Schwartz
- Laboratory for Molecular and Computational Genomics, Department of Chemistry, Laboratory of Genetics, USA and the UW-Biotechnology Center, University of Wisconsin-Madison, WI 53706, USA
| | - Mihai Pop
- Center for Bioinformatics and Computational Biology, Applied Math & Statistics, and Scientific Computation, Department of Computer Science, University of Maryland, College Park, MD 20742, USA and
| |
Collapse
|
10
|
Abstract
Optical mapping has been widely used to improve de novo plant genome assemblies, including rice, maize, Medicago, Amborella, tomato and wheat, with more genomes in the pipeline. Optical mapping provides long-range information of the genome and can more easily identify large structural variations. The ability of optical mapping to assay long single DNA molecules nicely complements short-read sequencing which is more suitable for the identification of small and short-range variants. Direct use of optical mapping to study population-level genetic diversity is currently limited to microbial strain typing and human diversity studies. Nonetheless, optical mapping shows great promise in the study of plant trait development, domestication and polyploid evolution. Here we review the current applications and future prospects of optical mapping in the field of plant comparative genomics.
Collapse
|
11
|
Mendelowitz L, Pop M. Computational methods for optical mapping. Gigascience 2014; 3:33. [PMID: 25671093 PMCID: PMC4323141 DOI: 10.1186/2047-217x-3-33] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2014] [Accepted: 12/02/2014] [Indexed: 11/10/2022] Open
Abstract
Optical mapping and newer genome mapping technologies based on nicking enzymes provide low resolution but long-range genomic information. The optical mapping technique has been successfully used for assessing the quality of genome assemblies and for detecting large-scale structural variants and rearrangements that cannot be detected using current paired end sequencing protocols. Here, we review several algorithms and methods for building consensus optical maps and aligning restriction patterns to a reference map, as well as methods for using optical maps with sequence assemblies.
Collapse
Affiliation(s)
- Lee Mendelowitz
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD USA ; Applied Math & Statistics, and Scientific Computation, University of Maryland, College Park, MD USA
| | - Mihai Pop
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD USA ; Department of Computer Science, University of Maryland, College Park, MD USA
| |
Collapse
|