1
|
Nogin Y, Sapir D, Zur TD, Weinberger N, Belinkov Y, Ebenstein Y, Shechtman Y. OM2Seq: learning retrieval embeddings for optical genome mapping. BIOINFORMATICS ADVANCES 2024; 4:vbae079. [PMID: 38915884 PMCID: PMC11194751 DOI: 10.1093/bioadv/vbae079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 04/27/2024] [Accepted: 05/28/2024] [Indexed: 06/26/2024]
Abstract
Motivation Genomics-based diagnostic methods that are quick, precise, and economical are essential for the advancement of precision medicine, with applications spanning the diagnosis of infectious diseases, cancer, and rare diseases. One technology that holds potential in this field is optical genome mapping (OGM), which is capable of detecting structural variations, epigenomic profiling, and microbial species identification. It is based on imaging of linearized DNA molecules that are stained with fluorescent labels, that are then aligned to a reference genome. However, the computational methods currently available for OGM fall short in terms of accuracy and computational speed. Results This work introduces OM2Seq, a new approach for the rapid and accurate mapping of DNA fragment images to a reference genome. Based on a Transformer-encoder architecture, OM2Seq is trained on acquired OGM data to efficiently encode DNA fragment images and reference genome segments to a common embedding space, which can be indexed and efficiently queried using a vector database. We show that OM2Seq significantly outperforms the baseline methods in both computational speed (by 2 orders of magnitude) and accuracy. Availability and implementation https://github.com/yevgenin/om2seq.
Collapse
Affiliation(s)
- Yevgeni Nogin
- Russel Berrie Nanotechnology Institute, Technion, Haifa 320003, Israel
| | - Danielle Sapir
- Faculty of Electrical and Computer Engineering, Technion, Haifa 320003, Israel
| | - Tahir Detinis Zur
- Department of Chemistry, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Nir Weinberger
- Faculty of Electrical and Computer Engineering, Technion, Haifa 320003, Israel
| | | | - Yuval Ebenstein
- Department of Chemistry, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
- Department of Biomedical Engineering, Faculty of Engineering, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Yoav Shechtman
- Russel Berrie Nanotechnology Institute, Technion, Haifa 320003, Israel
- Department of Biomedical Engineering, Technion, Haifa 320003, Israel
- Lorry I. Lokey Center for Life Sciences and Engineering, Technion, Haifa 320003, Israel
- Department of Mechanical Engineering, University of Texas at Austin, Austin, TX 78712, United States
| |
Collapse
|
2
|
Do V, Nguyen S, Le D, Nguyen T, Nguyen C, Ho T, Vo N, Nguyen T, Nguyen H, Cao M. Pasa: leveraging population pangenome graph to scaffold prokaryote genome assemblies. Nucleic Acids Res 2024; 52:e15. [PMID: 38084888 PMCID: PMC10853769 DOI: 10.1093/nar/gkad1170] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Revised: 11/07/2023] [Accepted: 11/22/2023] [Indexed: 02/10/2024] Open
Abstract
Whole genome sequencing has increasingly become the essential method for studying the genetic mechanisms of antimicrobial resistance and for surveillance of drug-resistant bacterial pathogens. The majority of bacterial genomes sequenced to date have been sequenced with Illumina sequencing technology, owing to its high-throughput, excellent sequence accuracy, and low cost. However, because of the short-read nature of the technology, these assemblies are fragmented into large numbers of contigs, hindering the obtaining of full information of the genome. We develop Pasa, a graph-based algorithm that utilizes the pangenome graph and the assembly graph information to improve scaffolding quality. By leveraging the population information of the bacteria species, Pasa is able to utilize the linkage information of the gene families of the species to resolve the contig graph of the assembly. We show that our method outperforms the current state of the arts in terms of accuracy, and at the same time, is computationally efficient to be applied to a large number of existing draft assemblies.
Collapse
Affiliation(s)
- Van Hoan Do
- Center for Applied Mathematics and Informatics, Le Quy Don Technical University, Hanoi, Vietnam
| | | | - Duc Quang Le
- Faculty of IT, Hanoi University of Civil Engineering, Hanoi, Vietnam
| | - Tam Thi Nguyen
- Oxford University Clinical Research Unit, Hanoi, Vietnam
| | - Canh Hao Nguyen
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Japan
| | - Tho Huu Ho
- Department of Medical Microbiology, The 103 Military Hospital, Vietnam Military Medical University, Hanoi, Vietnam
- Department of Genomics & Cytogenetics, Institute of Biomedicine & Pharmacy, Vietnam Military Medical University, Hanoi, Vietnam
| | - Nam S Vo
- Center for Biomedical Informatics, Vingroup Big Data Institute, Hanoi, Vietnam
| | | | | | | |
Collapse
|
3
|
Nogin Y, Bar-Lev D, Hanania D, Detinis Zur T, Ebenstein Y, Yaakobi E, Weinberger N, Shechtman Y. Design of optimal labeling patterns for optical genome mapping via information theory. Bioinformatics 2023; 39:btad601. [PMID: 37758248 PMCID: PMC10563147 DOI: 10.1093/bioinformatics/btad601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Revised: 08/31/2023] [Accepted: 09/26/2023] [Indexed: 10/03/2023] Open
Abstract
MOTIVATION Optical genome mapping (OGM) is a technique that extracts partial genomic information from optically imaged and linearized DNA fragments containing fluorescently labeled short sequence patterns. This information can be used for various genomic analyses and applications, such as the detection of structural variations and copy-number variations, epigenomic profiling, and microbial species identification. Currently, the choice of labeled patterns is based on the available biochemical methods and is not necessarily optimized for the application. RESULTS In this work, we develop a model of OGM based on information theory, which enables the design of optimal labeling patterns for specific applications and target organism genomes. We validated the model through experimental OGM on human DNA and simulations on bacterial DNA. Our model predicts up to 10-fold improved accuracy by optimal choice of labeling patterns, which may guide future development of OGM biochemical labeling methods and significantly improve its accuracy and yield for applications such as epigenomic profiling and cultivation-free pathogen identification in clinical samples. AVAILABILITY AND IMPLEMENTATION https://github.com/yevgenin/PatternCode.
Collapse
Affiliation(s)
- Yevgeni Nogin
- Russell Berrie Nanotechnology Institute, Technion, Haifa 320003, Israel
| | | | - Dganit Hanania
- Department of Computer Science, Technion, Haifa 320003, Israel
| | - Tahir Detinis Zur
- Department of Chemistry, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Yuval Ebenstein
- Department of Chemistry, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviv University, Tel Aviv 6997801, Israel
- Department of Biomedical Engineering, Faculty of Engineering, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Eitan Yaakobi
- Department of Computer Science, Technion, Haifa 320003, Israel
| | - Nir Weinberger
- Department of Electrical Engineering, Technion, Haifa 320003, Israel
| | - Yoav Shechtman
- Russell Berrie Nanotechnology Institute, Technion, Haifa 320003, Israel
- Department of Biomedical Engineering, Technion, Haifa 320003, Israel
- Lorry I. Lokey Center for Life Sciences and Engineering, Technion, Haifa 320003, Israel
| |
Collapse
|
4
|
Coombe L, Li JX, Lo T, Wong J, Nikolic V, Warren RL, Birol I. LongStitch: high-quality genome assembly correction and scaffolding using long reads. BMC Bioinformatics 2021; 22:534. [PMID: 34717540 PMCID: PMC8557608 DOI: 10.1186/s12859-021-04451-7] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2021] [Accepted: 10/19/2021] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. RESULTS LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of Caenorhabditis elegans, Oryza sativa, and three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 1.2-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently improves upon human assemblies in under five hours using less than 23 GB of RAM. CONCLUSIONS Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch .
Collapse
Affiliation(s)
- Lauren Coombe
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Research, 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada.
| | - Janet X Li
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Research, 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - Theodora Lo
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Research, 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - Johnathan Wong
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Research, 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - Vladimir Nikolic
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Research, 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Research, 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Research, 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| |
Collapse
|
5
|
Abstract
In genomics, optical mapping technology provides long-range contiguity information to improve genome sequence assemblies and detect structural variation. Originally a laborious manual process, Bionano Genomics platforms now offer high-throughput, automated optical mapping based on chips packed with nanochannels through which unwound DNA is guided and the fluorescent DNA backbone and specific restriction sites are recorded. Although the raw image data obtained is of high quality, the processing and assembly software accompanying the platforms is closed source and does not seem to make full use of data, labeling approximately half of the measured signals as unusable. Here we introduce two new software tools, independent of Bionano Genomics software, to extract and process molecules from raw images (OptiScan) and to perform molecule-to-molecule and molecule-to-reference alignments using a novel signal-based approach (OptiMap). We demonstrate that the molecules detected by OptiScan can yield better assemblies, and that the approach taken by OptiMap results in higher use of molecules from the raw data. These tools lay the foundation for a suite of open-source methods to process and analyze high-throughput optical mapping data. The Python implementations of the OptiTools are publicly available through http://www.bif.wur.nl/.
Collapse
|
6
|
Walve R, Puglisi SJ, Salmela L. Space-Efficient Indexing of Spaced Seeds for Accurate Overlap Computation of Raw Optical Mapping Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; PP:2454-2462. [PMID: 34057895 DOI: 10.1109/tcbb.2021.3085086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
A key problem in processing raw optical mapping data (Rmaps) is finding Rmaps originating from the same genomic region. These sets of related Rmaps can be used to correct errors in Rmap data, and to find overlaps between Rmaps to assemble consensus optical maps. Previous Rmap overlap aligners are computationally very expensive and do not scale to large eukaryotic data sets. We present Selkie, an Rmap overlap aligner based on a spaced (l,k)-mer index which was pioneered in the Rmap error correction tool Elmeri. Here we present a space efficient version of the index which is twice as fast as prior art while using just a quarter of the memory on a human data set. Moreover, our index can be used for filtering candidates for Rmap overlap computation, whereas Elmeri used the index only for error correction of Rmaps. By combining our filtering of Rmaps with the exhaustive, but highly accurate, algorithm of Valouev et al. (2006), Selkie maintains or increases the accuracy of finding overlapping Rmaps on a bacterial dataset while being at least four times faster. Furthermore, for finding overlaps in a human dataset, Selkie is up to two orders of magnitude faster than previous methods.
Collapse
|
7
|
Bertazzoni S, Jones DAB, Phan HT, Tan KC, Hane JK. Chromosome-level genome assembly and manually-curated proteome of model necrotroph Parastagonospora nodorum Sn15 reveals a genome-wide trove of candidate effector homologs, and redundancy of virulence-related functions within an accessory chromosome. BMC Genomics 2021; 22:382. [PMID: 34034667 PMCID: PMC8146201 DOI: 10.1186/s12864-021-07699-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2020] [Accepted: 05/11/2021] [Indexed: 11/19/2022] Open
Abstract
Background The fungus Parastagonospora nodorum causes septoria nodorum blotch (SNB) of wheat (Triticum aestivum) and is a model species for necrotrophic plant pathogens. The genome assembly of reference isolate Sn15 was first reported in 2007. P. nodorum infection is promoted by its production of proteinaceous necrotrophic effectors, three of which are characterised – ToxA, Tox1 and Tox3. Results A chromosome-scale genome assembly of P. nodorum Australian reference isolate Sn15, which combined long read sequencing, optical mapping and manual curation, produced 23 chromosomes with 21 chromosomes possessing both telomeres. New transcriptome data were combined with fungal-specific gene prediction techniques and manual curation to produce a high-quality predicted gene annotation dataset, which comprises 13,869 high confidence genes, and an additional 2534 lower confidence genes retained to assist pathogenicity effector discovery. Comparison to a panel of 31 internationally-sourced isolates identified multiple hotspots within the Sn15 genome for mutation or presence-absence variation, which was used to enhance subsequent effector prediction. Effector prediction resulted in 257 candidates, of which 98 higher-ranked candidates were selected for in-depth analysis and revealed a wealth of functions related to pathogenicity. Additionally, 11 out of the 98 candidates also exhibited orthology conservation patterns that suggested lateral gene transfer with other cereal-pathogenic fungal species. Analysis of the pan-genome indicated the smallest chromosome of 0.4 Mbp length to be an accessory chromosome (AC23). AC23 was notably absent from an avirulent isolate and is predominated by mutation hotspots with an increase in non-synonymous mutations relative to other chromosomes. Surprisingly, AC23 was deficient in effector candidates, but contained several predicted genes with redundant pathogenicity-related functions. Conclusions We present an updated series of genomic resources for P. nodorum Sn15 – an important reference isolate and model necrotroph – with a comprehensive survey of its predicted pathogenicity content. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07699-8.
Collapse
Affiliation(s)
| | - Darcy A B Jones
- Centre for Crop & Disease Management, Curtin University, Perth, Australia
| | - Huyen T Phan
- Centre for Crop & Disease Management, Curtin University, Perth, Australia.
| | - Kar-Chun Tan
- Centre for Crop & Disease Management, Curtin University, Perth, Australia.
| | - James K Hane
- Centre for Crop & Disease Management, Curtin University, Perth, Australia. .,Curtin Institute for Computation, Curtin University, Perth, Australia.
| |
Collapse
|
8
|
Jeffet J, Margalit S, Michaeli Y, Ebenstein Y. Single-molecule optical genome mapping in nanochannels: multidisciplinarity at the nanoscale. Essays Biochem 2021; 65:51-66. [PMID: 33739394 PMCID: PMC8056043 DOI: 10.1042/ebc20200021] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2021] [Revised: 02/24/2021] [Accepted: 02/26/2021] [Indexed: 12/12/2022]
Abstract
The human genome contains multiple layers of information that extend beyond the genetic sequence. In fact, identical genetics do not necessarily yield identical phenotypes as evident for the case of two different cell types in the human body. The great variation in structure and function displayed by cells with identical genetic background is attributed to additional genomic information content. This includes large-scale genetic aberrations, as well as diverse epigenetic patterns that are crucial for regulating specific cell functions. These genetic and epigenetic patterns operate in concert in order to maintain specific cellular functions in health and disease. Single-molecule optical genome mapping is a high-throughput genome analysis method that is based on imaging long chromosomal fragments stretched in nanochannel arrays. The access to long DNA molecules coupled with fluorescent tagging of various genomic information presents a unique opportunity to study genetic and epigenetic patterns in the genome at a single-molecule level over large genomic distances. Optical mapping entwines synergistically chemical, physical, and computational advancements, to uncover invaluable biological insights, inaccessible by sequencing technologies. Here we describe the method's basic principles of operation, and review the various available mechanisms to fluorescently tag genomic information. We present some of the recent biological and clinical impact enabled by optical mapping and present recent approaches for increasing the method's resolution and accuracy. Finally, we discuss how multiple layers of genomic information may be mapped simultaneously on the same DNA molecule, thus paving the way for characterizing multiple genomic observables on individual DNA molecules.
Collapse
Affiliation(s)
- Jonathan Jeffet
- Raymond and Beverly Sackler Faculty of Exact Sciences, Center for Nanoscience and Nanotechnology, Center for Light Matter Interaction, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Sapir Margalit
- Raymond and Beverly Sackler Faculty of Exact Sciences, Center for Nanoscience and Nanotechnology, Center for Light Matter Interaction, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Yael Michaeli
- Raymond and Beverly Sackler Faculty of Exact Sciences, Center for Nanoscience and Nanotechnology, Center for Light Matter Interaction, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Yuval Ebenstein
- Raymond and Beverly Sackler Faculty of Exact Sciences, Center for Nanoscience and Nanotechnology, Center for Light Matter Interaction, Tel Aviv University, Tel Aviv 6997801, Israel
| |
Collapse
|
9
|
Luo J, Wei Y, Lyu M, Wu Z, Liu X, Luo H, Yan C. A comprehensive review of scaffolding methods in genome assembly. Brief Bioinform 2021; 22:6149347. [PMID: 33634311 DOI: 10.1093/bib/bbab033] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 01/21/2021] [Accepted: 01/22/2021] [Indexed: 12/20/2022] Open
Abstract
In the field of genome assembly, scaffolding methods make it possible to obtain a more complete and contiguous reference genome, which is the cornerstone of genomic research. Scaffolding methods typically utilize the alignments between contigs and sequencing data (reads) to determine the orientation and order among contigs and to produce longer scaffolds, which are helpful for genomic downstream analysis. With the rapid development of high-throughput sequencing technologies, diverse types of reads have emerged over the past decade, especially in long-range sequencing, which have greatly enhanced the assembly quality of scaffolding methods. As the number of scaffolding methods increases, biology and bioinformatics researchers need to perform in-depth analyses of state-of-the-art scaffolding methods. In this article, we focus on the difficulties in scaffolding, the differences in characteristics among various kinds of reads, the methods by which current scaffolding methods address these difficulties, and future research opportunities. We hope this work will benefit the design of new scaffolding methods and the selection of appropriate scaffolding methods for specific biological studies.
Collapse
Affiliation(s)
- Junwei Luo
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Yawei Wei
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Mengna Lyu
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Zhengjiang Wu
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Xiaoyan Liu
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Huimin Luo
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Chaokun Yan
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| |
Collapse
|
10
|
|
11
|
Hu J, Fan J, Sun Z, Liu S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 2020; 36:2253-2255. [PMID: 31778144 DOI: 10.1093/bioinformatics/btz891] [Citation(s) in RCA: 566] [Impact Index Per Article: 141.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2019] [Revised: 10/07/2019] [Accepted: 11/26/2019] [Indexed: 01/20/2023] Open
Abstract
MOTIVATION Although long-read sequencing technologies can produce genomes with long contiguity, they suffer from high error rates. Thus, we developed NextPolish, a tool that efficiently corrects sequence errors in genomes assembled with long reads. This new tool consists of two interlinked modules that are designed to score and count K-mers from high quality short reads, and to polish genome assemblies containing large numbers of base errors. RESULTS When evaluated for the speed and efficiency using human and a plant (Arabidopsis thaliana) genomes, NextPolish outperformed Pilon by correcting sequence errors faster, and with a higher correction accuracy. AVAILABILITY AND IMPLEMENTATION NextPolish is implemented in C and Python. The source code is available from https://github.com/Nextomics/NextPolish. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jiang Hu
- GrandOmics Biosciences, Beijing, 102200, China
| | - Junpeng Fan
- GrandOmics Biosciences, Beijing, 102200, China
| | - Zongyi Sun
- GrandOmics Biosciences, Beijing, 102200, China
| | - Shanlin Liu
- GrandOmics Biosciences, Beijing, 102200, China
| |
Collapse
|
12
|
Yuan Y, Chung CYL, Chan TF. Advances in optical mapping for genomic research. Comput Struct Biotechnol J 2020; 18:2051-2062. [PMID: 32802277 PMCID: PMC7419273 DOI: 10.1016/j.csbj.2020.07.018] [Citation(s) in RCA: 59] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2020] [Revised: 07/08/2020] [Accepted: 07/24/2020] [Indexed: 12/28/2022] Open
Abstract
Recent advances in optical mapping have allowed the construction of improved genome assemblies with greater contiguity. Optical mapping also enables genome comparison and identification of large-scale structural variations. Association of these large-scale genomic features with biological functions is an important goal in plant and animal breeding and in medical research. Optical mapping has also been used in microbiology and still plays an important role in strain typing and epidemiological studies. Here, we review the development of optical mapping in recent decades to illustrate its importance in genomic research. We detail its applications and algorithms to show its specific advantages. Finally, we discuss the challenges required to facilitate the optimization of optical mapping and improve its future development and application.
Collapse
Key Words
- 3D, three-dimensional
- DBG, de Bruijn graph
- DLS, direct label and strain
- DNA, deoxyribonucleic acid
- Genome assembly
- Hi-C, high-throughput chromosome conformation capture
- Mb, million base pair
- Next generation sequencing
- OLC, overlap-layout-consensus
- Optical mapping
- PCR, polymerase chain reaction
- PacBio, Pacific Biosciences
- SRS, short-read sequencing
- SV, structural variation
- Structural variation
- bp, base pair
- kb, kilobase pair
Collapse
Affiliation(s)
- Yuxuan Yuan
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
- State Key Laboratory for Agrobiotechnology, The Chinese University of Hong Kong, Hong Kong SAR, China
- AoE Centre for Genomic Studies on Plant-Environment Interaction for Sustainable Agriculture and Food Security, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Claire Yik-Lok Chung
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
- State Key Laboratory for Agrobiotechnology, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Ting-Fung Chan
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
- State Key Laboratory for Agrobiotechnology, The Chinese University of Hong Kong, Hong Kong SAR, China
- AoE Centre for Genomic Studies on Plant-Environment Interaction for Sustainable Agriculture and Food Security, The Chinese University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|
13
|
Bouwens A, Deen J, Vitale R, D’Huys L, Goyvaerts V, Descloux A, Borrenberghs D, Grussmayer K, Lukes T, Camacho R, Su J, Ruckebusch C, Lasser T, Van De Ville D, Hofkens J, Radenovic A, Frans Janssen KP. Identifying microbial species by single-molecule DNA optical mapping and resampling statistics. NAR Genom Bioinform 2020; 2:lqz007. [PMID: 33575560 PMCID: PMC7671359 DOI: 10.1093/nargab/lqz007] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2019] [Accepted: 09/12/2019] [Indexed: 12/13/2022] Open
Abstract
Single-molecule DNA mapping has the potential to serve as a powerful complement to high-throughput sequencing in metagenomic analysis. Offering longer read lengths and forgoing the need for complex library preparation and amplification, mapping stands to provide an unbiased view into the composition of complex viromes and/or microbiomes. To fully enable mapping-based metagenomics, sensitivity and specificity of DNA map analysis and identification need to be improved. Using detailed simulations and experimental data, we first demonstrate how fluorescence imaging of surface stretched, sequence specifically labeled DNA fragments can yield highly sensitive identification of targets. Second, a new analysis technique is introduced to increase specificity of the analysis, allowing even closely related species to be resolved. Third, we show how an increase in resolution improves sensitivity. Finally, we demonstrate that these methods are capable of identifying species with long genomes such as bacteria with high sensitivity.
Collapse
Affiliation(s)
- Arno Bouwens
- Department of Chemistry, Katholieke Universiteit Leuven, 3000 Leuven, Belgium
- Institute of Bioengineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
| | - Jochem Deen
- Institute of Bioengineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
- School of Engineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
| | - Raffaele Vitale
- Department of Chemistry, Katholieke Universiteit Leuven, 3000 Leuven, Belgium
- LASIR CNRS, Université de Lille, 59655 Villeneuve d’Ascq, France
| | - Laurens D’Huys
- Department of Chemistry, Katholieke Universiteit Leuven, 3000 Leuven, Belgium
| | - Vince Goyvaerts
- Department of Chemistry, Katholieke Universiteit Leuven, 3000 Leuven, Belgium
| | - Adrien Descloux
- Institute of Bioengineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
- School of Engineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
| | | | - Kristin Grussmayer
- Institute of Bioengineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
- School of Engineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
| | - Tomas Lukes
- School of Engineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
| | - Rafael Camacho
- Department of Chemistry, Katholieke Universiteit Leuven, 3000 Leuven, Belgium
| | - Jia Su
- Department of Chemistry, Katholieke Universiteit Leuven, 3000 Leuven, Belgium
| | - Cyril Ruckebusch
- LASIR CNRS, Université de Lille, 59655 Villeneuve d’Ascq, France
| | - Theo Lasser
- School of Engineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
| | - Dimitri Van De Ville
- Institute of Bioengineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
- Center for Neuroprosthetics, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
- Department of Radiology and Medical Informatics, Université de Genève, 1205 Genève, Switzerland
| | - Johan Hofkens
- Department of Chemistry, Katholieke Universiteit Leuven, 3000 Leuven, Belgium
| | - Aleksandra Radenovic
- Institute of Bioengineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
- School of Engineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
| | | |
Collapse
|
14
|
Sousa TDJ, Parise D, Profeta R, Parise MTD, Gomide ACP, Kato RB, Pereira FL, Figueiredo HCP, Ramos R, Brenig B, Costa da Silva ALD, Ghosh P, Barh D, Góes-Neto A, Azevedo V. Re-sequencing and optical mapping reveals misassemblies and real inversions on Corynebacterium pseudotuberculosis genomes. Sci Rep 2019; 9:16387. [PMID: 31705053 PMCID: PMC6841979 DOI: 10.1038/s41598-019-52695-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2019] [Accepted: 10/18/2019] [Indexed: 12/29/2022] Open
Abstract
The number of draft genomes deposited in Genbank from the National Center for Biotechnology Information (NCBI) is higher than the complete ones. Draft genomes are assemblies that contain fragments of misassembled regions (gaps). Such draft genomes present a hindrance to the complete understanding of the biology and evolution of the organism since they lack genomic information. To overcome this problem, strategies to improve the assembly process are developed continuously. Also, the greatest challenge to the assembly progress is the presence of repetitive DNA regions. This article highlights the use of optical mapping, to detect and correct assembly errors in Corynebacterium pseudotuberculosis. We also demonstrate that choosing a reference genome should be done with caution to avoid assembly errors and loss of genetic information.
Collapse
Affiliation(s)
- Thiago de Jesus Sousa
- Institute of Biological Sciences, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Doglas Parise
- Institute of Biological Sciences, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Rodrigo Profeta
- Institute of Biological Sciences, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | | | - Anne Cybelle Pinto Gomide
- Institute of Biological Sciences, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Rodrigo Bentos Kato
- Institute of Biological Sciences, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Felipe Luiz Pereira
- National Reference Laboratory for Aquatic Animal Diseases (AQUACEN) of Ministry of Agriculture, Livestock and Food Supply, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Henrique Cesar Pereira Figueiredo
- National Reference Laboratory for Aquatic Animal Diseases (AQUACEN) of Ministry of Agriculture, Livestock and Food Supply, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Rommel Ramos
- Institute of Biological Sciences, Federal University of Pará, Belém, Pará, Brazil
| | - Bertram Brenig
- Institute of Veterinary Medicine, University Göttingen, Göttingen, Germany
| | | | - Preetam Ghosh
- Department of Computer Science, Virginia Commonwealth University, Richmond, United States
| | - Debmalya Barh
- Institute of Integrative Omics and Applied Biotechnology, Nonakuri West Bengal, India
| | - Aristóteles Góes-Neto
- Institute of Biological Sciences, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Vasco Azevedo
- Institute of Biological Sciences, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil.
| |
Collapse
|
15
|
Wu S, Turner KM, Nguyen N, Raviram R, Erb M, Santini J, Luebeck J, Rajkumar U, Diao Y, Li B, Zhang W, Jameson N, Corces MR, Granja JM, Chen X, Coruh C, Abnousi A, Houston J, Ye Z, Hu R, Yu M, Kim H, Law JA, Verhaak RGW, Hu M, Furnari FB, Chang HY, Ren B, Bafna V, Mischel PS. Circular ecDNA promotes accessible chromatin and high oncogene expression. Nature 2019; 575:699-703. [PMID: 31748743 PMCID: PMC7094777 DOI: 10.1038/s41586-019-1763-5] [Citation(s) in RCA: 312] [Impact Index Per Article: 62.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2018] [Accepted: 09/26/2019] [Indexed: 01/01/2023]
Abstract
Oncogenes are commonly amplified on particles of extrachromosomal DNA (ecDNA) in cancer1,2, but our understanding of the structure of ecDNA and its effect on gene regulation is limited. Here, by integrating ultrastructural imaging, long-range optical mapping and computational analysis of whole-genome sequencing, we demonstrate the structure of circular ecDNA. Pan-cancer analyses reveal that oncogenes encoded on ecDNA are among the most highly expressed genes in the transcriptome of the tumours, linking increased copy number with high transcription levels. Quantitative assessment of the chromatin state reveals that although ecDNA is packaged into chromatin with intact domain structure, it lacks higher-order compaction that is typical of chromosomes and displays significantly enhanced chromatin accessibility. Furthermore, ecDNA is shown to have a significantly greater number of ultra-long-range interactions with active chromatin, which provides insight into how the structure of circular ecDNA affects oncogene function, and connects ecDNA biology with modern cancer genomics and epigenetics.
Collapse
Affiliation(s)
- Sihan Wu
- Ludwig Institute for Cancer Research, University of California at San Diego, La Jolla, CA, USA
| | - Kristen M Turner
- Ludwig Institute for Cancer Research, University of California at San Diego, La Jolla, CA, USA
- Boundless Bio, Inc., La Jolla, CA, USA
| | - Nam Nguyen
- Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA, USA
- Boundless Bio, Inc., La Jolla, CA, USA
| | - Ramya Raviram
- Ludwig Institute for Cancer Research, University of California at San Diego, La Jolla, CA, USA
| | - Marcella Erb
- UCSD Light Microscopy Core Facility, Department of Neurosciences, University of California at San Diego, La Jolla, CA, USA
| | - Jennifer Santini
- UCSD Light Microscopy Core Facility, Department of Neurosciences, University of California at San Diego, La Jolla, CA, USA
| | - Jens Luebeck
- Bioinformatics & Systems Biology Graduate Program, University of California at San Diego, La Jolla, CA, USA
| | - Utkrisht Rajkumar
- Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA, USA
| | - Yarui Diao
- Ludwig Institute for Cancer Research, University of California at San Diego, La Jolla, CA, USA
- Department of Cell Biology, Regeneration Next Initiative, Duke University School of Medicine, Durham, NC, USA
- Department of Orthopaedic Surgery, Regeneration Next Initiative, Duke University School of Medicine, Durham, NC, USA
| | - Bin Li
- Ludwig Institute for Cancer Research, University of California at San Diego, La Jolla, CA, USA
| | - Wenjing Zhang
- Ludwig Institute for Cancer Research, University of California at San Diego, La Jolla, CA, USA
| | - Nathan Jameson
- Ludwig Institute for Cancer Research, University of California at San Diego, La Jolla, CA, USA
| | - M Ryan Corces
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA, USA
| | - Jeffrey M Granja
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA, USA
| | - Xingqi Chen
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA, USA
- Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
| | - Ceyda Coruh
- Plant Molecular and Cellular Biology Laboratory, Salk Institute for Biological Studies, La Jolla, CA, USA
| | - Armen Abnousi
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic Foundation, Cleveland, OH, USA
| | - Jack Houston
- Ludwig Institute for Cancer Research, University of California at San Diego, La Jolla, CA, USA
| | - Zhen Ye
- Ludwig Institute for Cancer Research, University of California at San Diego, La Jolla, CA, USA
| | - Rong Hu
- Ludwig Institute for Cancer Research, University of California at San Diego, La Jolla, CA, USA
| | - Miao Yu
- Ludwig Institute for Cancer Research, University of California at San Diego, La Jolla, CA, USA
| | - Hoon Kim
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Julie A Law
- Plant Molecular and Cellular Biology Laboratory, Salk Institute for Biological Studies, La Jolla, CA, USA
| | - Roel G W Verhaak
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Ming Hu
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic Foundation, Cleveland, OH, USA
| | - Frank B Furnari
- Ludwig Institute for Cancer Research, University of California at San Diego, La Jolla, CA, USA
| | - Howard Y Chang
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA, USA.
- Howard Hughes Medical Institute, Stanford University, Stanford, CA, USA.
| | - Bing Ren
- Ludwig Institute for Cancer Research, University of California at San Diego, La Jolla, CA, USA.
- Department of Cellular and Molecular Medicine, Center for Epigenomics, University of California at San Diego, La Jolla, CA, USA.
- Institute of Genomic Medicine, Moores Cancer Center, University of California at San Diego, La Jolla, CA, USA.
| | - Vineet Bafna
- Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA, USA.
| | - Paul S Mischel
- Ludwig Institute for Cancer Research, University of California at San Diego, La Jolla, CA, USA.
- Moores Cancer Center, University of California at San Diego, La Jolla, CA, USA.
- Department of Pathology, University of California at San Diego, La Jolla, CA, USA.
| |
Collapse
|
16
|
Sedlazeck FJ, Lee H, Darby CA, Schatz MC. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat Rev Genet 2019; 19:329-346. [PMID: 29599501 DOI: 10.1038/s41576-018-0003-4] [Citation(s) in RCA: 291] [Impact Index Per Article: 58.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
Several new genomics technologies have become available that offer long-read sequencing or long-range mapping with higher throughput and higher resolution analysis than ever before. These long-range technologies are rapidly advancing the field with improved reference genomes, more comprehensive variant identification and more complete views of transcriptomes and epigenomes. However, they also require new bioinformatics approaches to take full advantage of their unique characteristics while overcoming their complex errors and modalities. Here, we discuss several of the most important applications of the new technologies, focusing on both the currently available bioinformatics tools and opportunities for future research.
Collapse
Affiliation(s)
- Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Hayan Lee
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Charlotte A Darby
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA. .,Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
| |
Collapse
|
17
|
Abstract
The computational reconstruction of genome sequences from shotgun sequencing data has been greatly simplified by the advent of sequencing technologies that generate long reads. In the case of relatively small genomes (e.g., bacterial or viral), complete genome sequences can frequently be reconstructed computationally without the need for further experiments. However, large and complex genomes, such as those of most animals and plants, continue to pose significant challenges. In such genomes, assembly software produces incomplete and fragmented reconstructions that require additional experimentally derived information and manual intervention in order to reconstruct individual chromosome arms. Recent technologies originally designed to capture chromatin structure have been shown to effectively complement sequencing data, leading to much more contiguous reconstructions of genomes than previously possible. Here, we survey these technologies and the algorithms used to assemble and analyze large eukaryotic genomes, placed within the historical context of genome scaffolding technologies that have been in existence since the dawn of the genomic era.
Collapse
Affiliation(s)
- Jay Ghurye
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
| | - Mihai Pop
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
| |
Collapse
|
18
|
Leung AKY, Kwok TP, Wan R, Xiao M, Kwok PY, Yip KY, Chan TF. OMBlast: alignment tool for optical mapping using a seed-and-extend approach. Bioinformatics 2018; 33:311-319. [PMID: 28172448 PMCID: PMC5409310 DOI: 10.1093/bioinformatics/btw620] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2015] [Revised: 08/31/2016] [Accepted: 09/26/2016] [Indexed: 11/15/2022] Open
Abstract
Motivation Optical mapping is a technique for capturing fluorescent signal patterns of long DNA molecules (in the range of 0.1–1 Mbp). Recently, it has been complementing the widely used short-read sequencing technology by assisting with scaffolding and detecting large and complex structural variations (SVs). Here, we introduce a fast, robust and accurate tool called OMBlast for aligning optical maps, the set of signal locations on the molecules generated from optical mapping. Our method is based on the seed-and-extend approach from sequence alignment, with modifications specific to optical mapping. Results Experiments with both synthetic and our real data demonstrate that OMBlast has higher accuracy and faster mapping speed than existing alignment methods. Our tool also shows significant improvement when aligning data with SVs. Availability and Implementation OMBlast is implemented for Java 1.7 and is released under a GPL license. OMBlast can be downloaded from https://github.com/aldenleung/OMBlast and run directly on machines equipped with a Java virtual machine. Supplementary information Supplementary data are available at Bioinformatics online
Collapse
Affiliation(s)
| | - Tsz-Piu Kwok
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China
| | - Raymond Wan
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong, China
| | - Ming Xiao
- School of Biomedical Engineering, Science and Health System, Drexel University, Philadelphia, PA, USA
| | - Pui-Yan Kwok
- Institute for Human Genetics.,Cardiovascular Research Institute, University of California San Francisco, San Francisco, CA, USA
| | - Kevin Y Yip
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China.,Hong Kong Bioinformatics Centre
| | - Ting-Fung Chan
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong, China,Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China.,Centre for Soybean Research, State Key Laboratory of Agrobiotechnology, The Chinese University of Hong Kong, Hong Kong, China.,Hong Kong Bioinformatics Centre
| |
Collapse
|
19
|
Abstract
The output from whole genome sequencing is a set of contigs, i.e. short non-overlapping DNA sequences (sizes 1-100 kilobasepairs). Piecing the contigs together is an especially difficult task for previously unsequenced DNA, and may not be feasible due to factors such as the lack of sufficient coverage or larger repetitive regions which generate gaps in the final sequence. Here we propose a new method for scaffolding such contigs. The proposed method uses densely labeled optical DNA barcodes from competitive binding experiments as scaffolds. On these scaffolds we position theoretical barcodes which are calculated from the contig sequences. This allows us to construct longer DNA sequences from the contig sequences. This proof-of-principle study extends previous studies which use sparsely labeled DNA barcodes for scaffolding purposes. Our method applies a probabilistic approach that allows us to discard “foreign” contigs from mixed samples with contigs from different types of DNA. We satisfy the contig non-overlap constraint by formulating the contig placement challenge as a combinatorial auction problem. Our exact algorithm for solving this problem reduces computational costs compared to previous methods in the combinatorial auction field. We demonstrate the usefulness of the proposed scaffolding method both for synthetic contigs and for contigs obtained using Illumina sequencing for a mixed sample with plasmid and chromosomal DNA.
Collapse
|
20
|
Chaney L, Sharp AR, Evans CR, Udall JA. Genome Mapping in Plant Comparative Genomics. TRENDS IN PLANT SCIENCE 2016; 21:770-780. [PMID: 27289181 DOI: 10.1016/j.tplants.2016.05.004] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/04/2016] [Revised: 04/27/2016] [Accepted: 05/12/2016] [Indexed: 05/10/2023]
Abstract
Genome mapping produces fingerprints of DNA sequences to construct a physical map of the whole genome. It provides contiguous, long-range information that complements and, in some cases, replaces sequencing data. Recent advances in genome-mapping technology will better allow researchers to detect large (>1kbp) structural variations between plant genomes. Some molecular and informatics complications need to be overcome for this novel technology to achieve its full utility. This technology will be useful for understanding phenotype responses due to DNA rearrangements and will yield insights into genome evolution, particularly in polyploids. In this review, we outline recent advances in genome-mapping technology, including the processes required for data collection and analysis, and applications in plant comparative genomics.
Collapse
Affiliation(s)
- Lindsay Chaney
- Plant and Wildlife Sciences Department, Brigham Young University, Provo, UT 84602, USA
| | - Aaron R Sharp
- Plant and Wildlife Sciences Department, Brigham Young University, Provo, UT 84602, USA
| | - Carrie R Evans
- Plant and Wildlife Sciences Department, Brigham Young University, Provo, UT 84602, USA
| | - Joshua A Udall
- Plant and Wildlife Sciences Department, Brigham Young University, Provo, UT 84602, USA.
| |
Collapse
|
21
|
Veeckman E, Ruttink T, Vandepoele K. Are We There Yet? Reliably Estimating the Completeness of Plant Genome Sequences. THE PLANT CELL 2016; 28:1759-68. [PMID: 27512012 PMCID: PMC5006709 DOI: 10.1105/tpc.16.00349] [Citation(s) in RCA: 42] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/03/2016] [Revised: 07/13/2016] [Accepted: 08/09/2016] [Indexed: 05/18/2023]
Abstract
Genome sequencing is becoming cheaper and faster thanks to the introduction of next-generation sequencing techniques. Dozens of new plant genome sequences have been released in recent years, ranging from small to gigantic repeat-rich or polyploid genomes. Most genome projects have a dual purpose: delivering a contiguous, complete genome assembly and creating a full catalog of correctly predicted genes. Frequently, the completeness of a species' gene catalog is measured using a set of marker genes that are expected to be present. This expectation can be defined along an evolutionary gradient, ranging from highly conserved genes to species-specific genes. Large-scale population resequencing studies have revealed that gene space is fairly variable even between closely related individuals, which limits the definition of the expected gene space, and, consequently, the accuracy of estimates used to assess genome and gene space completeness. We argue that, based on the desired applications of a genome sequencing project, different completeness scores for the genome assembly and/or gene space should be determined. Using examples from several dicot and monocot genomes, we outline some pitfalls and recommendations regarding methods to estimate completeness during different steps of genome assembly and annotation.
Collapse
Affiliation(s)
- Elisabeth Veeckman
- Institute for Agricultural and Fisheries Research, Plant Sciences Unit, Growth and Development, B-9090 Melle, Belgium Bioinformatics Institute Ghent, Ghent University, B-9052 Ghent, Belgium
| | - Tom Ruttink
- Institute for Agricultural and Fisheries Research, Plant Sciences Unit, Growth and Development, B-9090 Melle, Belgium Bioinformatics Institute Ghent, Ghent University, B-9052 Ghent, Belgium
| | - Klaas Vandepoele
- Bioinformatics Institute Ghent, Ghent University, B-9052 Ghent, Belgium Department of Plant Systems Biology, VIB, Technologiepark 927, B-9052 Ghent, Belgium Department of Plant Biotechnology and Bioinformatics, Ghent University, B-9052 Ghent, Belgium
| |
Collapse
|
22
|
Sharp AR, Udall JA. OMWare: a tool for efficient assembly of genome-wide physical maps. BMC Bioinformatics 2016; 17 Suppl 7:241. [PMID: 27454532 PMCID: PMC4965707 DOI: 10.1186/s12859-016-1099-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Physical mapping of DNA with restriction enzymes allows for the characterization and assembly of much longer molecules than is feasible with sequencing. However, assemblies of physical map data are sensitive to input parameters, which describe noise inherent in the data collection process. One possible way to determine the parameter values that best describe a dataset is by trial and error. RESULTS Here we present OMWare, a tool that efficiently generated 405 de novo map assemblies of a single datasets collected from the cotton species Gossypium raimondii. The assemblies were generated using various input parameter values, and were completed more efficiently by re-using compatible intermediate results. These assemblies were assayed for contiguity, internal consistency, and accuracy. CONCLUSIONS Resulting assemblies had variable qualities. Although highly accurate assemblies were found, contiguity and internal consistency metrics were poor predictors of accuracy.
Collapse
Affiliation(s)
- Aaron R Sharp
- College of Life Sciences, Brigham Young University, Provo, UT, 84602-2400, USA.
| | - Joshua A Udall
- College of Life Sciences, Brigham Young University, Provo, UT, 84602-2400, USA
| |
Collapse
|
23
|
Olson ND, Zook JM, Samarov DV, Jackson SA, Salit ML. PEPR: pipelines for evaluating prokaryotic references. Anal Bioanal Chem 2016; 408:2975-83. [PMID: 26935931 PMCID: PMC4819933 DOI: 10.1007/s00216-015-9299-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2015] [Revised: 12/21/2015] [Accepted: 12/23/2015] [Indexed: 11/17/2022]
Abstract
The rapid adoption of microbial whole genome sequencing in public health, clinical testing, and forensic laboratories requires the use of validated measurement processes. Well-characterized, homogeneous, and stable microbial genomic reference materials can be used to evaluate measurement processes, improving confidence in microbial whole genome sequencing results. We have developed a reproducible and transparent bioinformatics tool, PEPR, Pipelines for Evaluating Prokaryotic References, for characterizing the reference genome of prokaryotic genomic materials. PEPR evaluates the quality, purity, and homogeneity of the reference material genome, and purity of the genomic material. The quality of the genome is evaluated using high coverage paired-end sequence data; coverage, paired-end read size and direction, as well as soft-clipping rates, are used to identify mis-assemblies. The homogeneity and purity of the material relative to the reference genome are characterized by comparing base calls from replicate datasets generated using multiple sequencing technologies. Genomic purity of the material is assessed by checking for DNA contaminants. We demonstrate the tool and its output using sequencing data while developing a Staphylococcus aureus candidate genomic reference material. PEPR is open source and available at https://github.com/usnistgov/pepr .
Collapse
Affiliation(s)
- Nathan D Olson
- Biosystems and Biomaterials Division, Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA.
| | - Justin M Zook
- Biosystems and Biomaterials Division, Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Daniel V Samarov
- Statistical Engineering Division, Information Technology Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Scott A Jackson
- Biosystems and Biomaterials Division, Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Marc L Salit
- Biosystems and Biomaterials Division, Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| |
Collapse
|
24
|
Verzotto D, M. Teo AS, Hillmer AM, Nagarajan N. OPTIMA: sensitive and accurate whole-genome alignment of error-prone genomic maps by combinatorial indexing and technology-agnostic statistical analysis. Gigascience 2016; 5:2. [PMID: 26793302 PMCID: PMC4719737 DOI: 10.1186/s13742-016-0110-0] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2015] [Accepted: 01/06/2016] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Resolution of complex repeat structures and rearrangements in the assembly and analysis of large eukaryotic genomes is often aided by a combination of high-throughput sequencing and genome-mapping technologies (for example, optical restriction mapping). In particular, mapping technologies can generate sparse maps of large DNA fragments (150 kilo base pairs (kbp) to 2 Mbp) and thus provide a unique source of information for disambiguating complex rearrangements in cancer genomes. Despite their utility, combining high-throughput sequencing and mapping technologies has been challenging because of the lack of efficient and sensitive map-alignment algorithms for robustly aligning error-prone maps to sequences. RESULTS We introduce a novel seed-and-extend glocal (short for global-local) alignment method, OPTIMA (and a sliding-window extension for overlap alignment, OPTIMA-Overlap), which is the first to create indexes for continuous-valued mapping data while accounting for mapping errors. We also present a novel statistical model, agnostic with respect to technology-dependent error rates, for conservatively evaluating the significance of alignments without relying on expensive permutation-based tests. CONCLUSIONS We show that OPTIMA and OPTIMA-Overlap outperform other state-of-the-art approaches (1.6-2 times more sensitive) and are more efficient (170-200 %) and precise in their alignments (nearly 99 % precision). These advantages are independent of the quality of the data, suggesting that our indexing approach and statistical evaluation are robust, provide improved sensitivity and guarantee high precision.
Collapse
Affiliation(s)
- Davide Verzotto
- Computational and Systems Biology, Genome Institute of Singapore, 60 Biopolis Street, Singapore, 138672 Singapore
| | - Audrey S. M. Teo
- Cancer Therapeutics and Stratified Oncology, Genome Institute of Singapore, 60 Biopolis Street, Singapore, 138672 Singapore
| | - Axel M. Hillmer
- Cancer Therapeutics and Stratified Oncology, Genome Institute of Singapore, 60 Biopolis Street, Singapore, 138672 Singapore
| | - Niranjan Nagarajan
- Computational and Systems Biology, Genome Institute of Singapore, 60 Biopolis Street, Singapore, 138672 Singapore
| |
Collapse
|
25
|
Abstract
Optical Mapping is an established single-molecule, whole-genome analysis system, which has been used to gain a comprehensive understanding of genomic structure and to study structural variation of complex genomes. A critical component of Optical Mapping system is the image processing module, which extracts single molecule restriction maps from image datasets of immobilized, restriction digested and fluorescently stained large DNA molecules. In this review, we describe robust and efficient image processing techniques to process these massive datasets and extract accurate restriction maps in the presence of noise, ambiguity and confounding artifacts. We also highlight a few applications of the Optical Mapping system.
Collapse
Affiliation(s)
- Prabu Ravindran
- Laboratory of Molecular and Computational Genomics, Department of Chemistry, Laboratory of Genetics and Biotechnology Center, University of Wisconsin, 425 Henry Mall, Madison, USA
| | - Aditya Gupta
- Laboratory of Molecular and Computational Genomics, Department of Chemistry, Laboratory of Genetics and Biotechnology Center, University of Wisconsin, 425 Henry Mall, Madison, USA
| |
Collapse
|
26
|
Muggli MD, Puglisi SJ, Ronen R, Boucher C. Misassembly detection using paired-end sequence reads and optical mapping data. Bioinformatics 2015; 31:i80-8. [PMID: 26072512 PMCID: PMC4542784 DOI: 10.1093/bioinformatics/btv262] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Motivation: A crucial problem in genome assembly is the discovery and correction of misassembly errors in draft genomes. We develop a method called misSEQuel that enhances the quality of draft genomes by identifying misassembly errors and their breakpoints using paired-end sequence reads and optical mapping data. Our method also fulfills the critical need for open source computational methods for analyzing optical mapping data. We apply our method to various assemblies of the loblolly pine, Francisella tularensis, rice and budgerigar genomes. We generated and used stimulated optical mapping data for loblolly pine and F.tularensis and used real optical mapping data for rice and budgerigar. Results: Our results demonstrate that we detect more than 54% of extensively misassembled contigs and more than 60% of locally misassembled contigs in assemblies of F.tularensis and between 31% and 100% of extensively misassembled contigs and between 57% and 73% of locally misassembled contigs in assemblies of loblolly pine. Using the real optical mapping data, we correctly identified 75% of extensively misassembled contigs and 100% of locally misassembled contigs in rice, and 77% of extensively misassembled contigs and 80% of locally misassembled contigs in budgerigar. Availability and implementation:misSEQuel can be used as a post-processing step in combination with any genome assembler and is freely available at http://www.cs.colostate.edu/seq/. Contact:muggli@cs.colostate.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Martin D Muggli
- Department of Computer Science, Colorado State University, Fort Collins, CO 80526, USA, Department of Computer Science, University of Helsinki, Finland and Bioinformatics Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA
| | - Simon J Puglisi
- Department of Computer Science, Colorado State University, Fort Collins, CO 80526, USA, Department of Computer Science, University of Helsinki, Finland and Bioinformatics Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA
| | - Roy Ronen
- Department of Computer Science, Colorado State University, Fort Collins, CO 80526, USA, Department of Computer Science, University of Helsinki, Finland and Bioinformatics Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA
| | - Christina Boucher
- Department of Computer Science, Colorado State University, Fort Collins, CO 80526, USA, Department of Computer Science, University of Helsinki, Finland and Bioinformatics Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA
| |
Collapse
|
27
|
Shelton JM, Coleman MC, Herndon N, Lu N, Lam ET, Anantharaman T, Sheth P, Brown SJ. Tools and pipelines for BioNano data: molecule assembly pipeline and FASTA super scaffolding tool. BMC Genomics 2015; 16:734. [PMID: 26416786 PMCID: PMC4587741 DOI: 10.1186/s12864-015-1911-8] [Citation(s) in RCA: 69] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2015] [Accepted: 09/09/2015] [Indexed: 12/05/2022] Open
Abstract
Background Genome assembly remains an unsolved problem. Assembly projects face a range of hurdles that confound assembly. Thus a variety of tools and approaches are needed to improve draft genomes. Results We used a custom assembly workflow to optimize consensus genome map assembly, resulting in an assembly equal to the estimated length of the Tribolium castaneum genome and with an N50 of more than 1 Mb. We used this map for super scaffolding the T. castaneum sequence assembly, more than tripling its N50 with the program Stitch. Conclusions In this article we present software that leverages consensus genome maps assembled from extremely long single molecule maps to increase the contiguity of sequence assemblies. We report the results of applying these tools to validate and improve a 7x Sanger draft of the T. castaneum genome. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1911-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jennifer M Shelton
- KSU/K-INBRE Bioinformatics Center, Division of Biology, Kansas State University, Manhattan, KS, USA.
| | - Michelle C Coleman
- KSU/K-INBRE Bioinformatics Center, Division of Biology, Kansas State University, Manhattan, KS, USA.
| | - Nic Herndon
- KSU/K-INBRE Bioinformatics Center, Division of Biology, Kansas State University, Manhattan, KS, USA.
| | - Nanyan Lu
- BioNano Genomics, San Diego, CA, USA.
| | | | | | - Palak Sheth
- KSU/K-INBRE Bioinformatics Center, Division of Biology, Kansas State University, Manhattan, KS, USA.
| | - Susan J Brown
- KSU/K-INBRE Bioinformatics Center, Division of Biology, Kansas State University, Manhattan, KS, USA.
| |
Collapse
|
28
|
Adams DJ, Doran AG, Lilue J, Keane TM. The Mouse Genomes Project: a repository of inbred laboratory mouse strain genomes. Mamm Genome 2015; 26:403-12. [PMID: 26123534 DOI: 10.1007/s00335-015-9579-6] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2015] [Accepted: 06/11/2015] [Indexed: 12/16/2022]
Abstract
The Mouse Genomes Project was initiated in 2009 with the goal of using next-generation sequencing technologies to catalogue molecular variation in the common laboratory mouse strains, and a selected set of wild-derived inbred strains. The initial sequencing and survey of sequence variation in 17 inbred strains was completed in 2011 and included comprehensive catalogue of single nucleotide polymorphisms, short insertion/deletions, larger structural variants including their fine scale architecture and landscape of transposable element variation, and genomic sites subject to post-transcriptional alteration of RNA. From this beginning, the resource has expanded significantly to include 36 fully sequenced inbred laboratory mouse strains, a refined and updated data processing pipeline, and new variation querying and data visualisation tools which are available on the project's website ( http://www.sanger.ac.uk/resources/mouse/genomes/ ). The focus of the project is now the completion of de novo assembled chromosome sequences and strain-specific gene structures for the core strains. We discuss how the assembled chromosomes will power comparative analysis, data access tools and future directions of mouse genetics.
Collapse
Affiliation(s)
- David J Adams
- Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK.
| | - Anthony G Doran
- Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK.
| | - Jingtao Lilue
- Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK.
| | - Thomas M Keane
- Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK.
| |
Collapse
|