1
|
Luo J, Guan T, Chen G, Yu Z, Zhai H, Yan C, Luo H. SLHSD: hybrid scaffolding method based on short and long reads. Brief Bioinform 2023; 24:7152317. [PMID: 37141142 DOI: 10.1093/bib/bbad169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2022] [Revised: 01/08/2023] [Accepted: 04/12/2023] [Indexed: 05/05/2023] Open
Abstract
In genome assembly, scaffolding can obtain more complete and continuous scaffolds. Current scaffolding methods usually adopt one type of read to construct a scaffold graph and then orient and order contigs. However, scaffolding with the strengths of two or more types of reads seems to be a better solution to some tricky problems. Combining the advantages of different types of data is significant for scaffolding. Here, a hybrid scaffolding method (SLHSD) is present that simultaneously leverages the precision of short reads and the length advantage of long reads. Building an optimal scaffold graph is an important foundation for getting scaffolds. SLHSD uses a new algorithm that combines long and short read alignment information to determine whether to add an edge and how to calculate the edge weight in a scaffold graph. In addition, SLHSD develops a strategy to ensure that edges with high confidence can be added to the graph with priority. Then, a linear programming model is used to detect and remove remaining false edges in the graph. We compared SLHSD with other scaffolding methods on five datasets. Experimental results show that SLHSD outperforms other methods. The open-source code of SLHSD is available at https://github.com/luojunwei/SLHSD.
Collapse
Affiliation(s)
- Junwei Luo
- School of Software, Henan Polytechnic University, Jiaozuo 454003, China
| | - Ting Guan
- School of Software, Henan Polytechnic University, Jiaozuo 454003, China
| | - Guolin Chen
- School of Software, Henan Polytechnic University, Jiaozuo 454003, China
| | - Zhonghua Yu
- School of Software, Henan Polytechnic University, Jiaozuo 454003, China
| | - Haixia Zhai
- School of Software, Henan Polytechnic University, Jiaozuo 454003, China
| | - Chaokun Yan
- School of Computer and Information Engineering, Henan University, Kaifeng 475001, China
| | - Huimin Luo
- School of Computer and Information Engineering, Henan University, Kaifeng 475001, China
| |
Collapse
|
2
|
Walve R, Salmela L. HGGA: hierarchical guided genome assembler. BMC Bioinformatics 2022; 23:167. [PMID: 35525918 PMCID: PMC9077837 DOI: 10.1186/s12859-022-04701-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Accepted: 04/25/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND De novo genome assembly typically produces a set of contigs instead of the complete genome. Thus additional data such as genetic linkage maps, optical maps, or Hi-C data is needed to resolve the complete structure of the genome. Most of the previous work uses the additional data to order and orient contigs. RESULTS Here we introduce a framework to guide genome assembly with additional data. Our approach is based on clustering the reads, such that each read in each cluster originates from nearby positions in the genome according to the additional data. These sets are then assembled independently and the resulting contigs are further assembled in a hierarchical manner. We implemented our approach for genetic linkage maps in a tool called HGGA. CONCLUSIONS Our experiments on simulated and real Pacific Biosciences long reads and genetic linkage maps show that HGGA produces a more contiguous assembly with less contigs and from 1.2 to 9.8 times higher NGA50 or N50 than a plain assembly of the reads and 1.03 to 6.5 times higher NGA50 or N50 than a previous approach integrating genetic linkage maps with contig assembly. Furthermore, also the correctness of the assembly remains similar or improves as compared to an assembly using only the read data.
Collapse
Affiliation(s)
- Riku Walve
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, Finland
| | - Leena Salmela
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, Finland.
| |
Collapse
|
3
|
Mukherjee K, Dole-Muinos D, Ajayi A, Rossi M, Prosperi M, Boucher C. Finding Overlapping Rmaps via Clustering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; PP:1-1. [PMID: 34890332 DOI: 10.1109/tcbb.2021.3132534] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Optical mapping has been largely automated, and first produces single molecule restriction maps, called Rmaps, which are assembled to generate genome wide optical maps. Since the location and orientation of each Rmap is unknown, the first problem in the analysis of this data is finding related Rmaps, i.e., pairs of Rmaps that share the same orientation and have significant overlap in their genomic location. Although heuristics for identifying related Rmaps exist, they all require quantization of the data which leads to a loss in the precision. In this paper, we propose a Gaussian mixture modelling clustering based method, which we refer to as O, that finds overlapping Rmaps without quantization. Using both simulated and real datasets, we show that OMclust substantially improves the precision (from 48.3% to 73.3%) over the state-of-the art methods while also reducing CPU time and memory consumption. Further, we integrated OMclust into the error correction methods (Elmeri and Comet) to demonstrate the increase in the performance of these methods. When OMclust was combined with Comet to error correct Rmap data generated from human DNA, it was able to error correct close to 3x more Ramps, and reduced the CPU time by more than 35x.
Collapse
|
4
|
Huang B, Wei G, Wang B, Ju F, Zhong Y, Shi Z, Sun S, Bu D. Filling gaps of genome scaffolds via probabilistic searching optical maps against assembly graph. BMC Bioinformatics 2021; 22:533. [PMID: 34717539 PMCID: PMC8557617 DOI: 10.1186/s12859-021-04448-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Accepted: 10/18/2021] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Optical maps record locations of specific enzyme recognition sites within long genome fragments. This long-distance information enables aligning genome assembly contigs onto optical maps and ordering contigs into scaffolds. The generated scaffolds, however, often contain a large amount of gaps. To fill these gaps, a feasible way is to search genome assembly graph for the best-matching contig paths that connect boundary contigs of gaps. The combination of searching and evaluation procedures might be "searching followed by evaluation", which is infeasible for long gaps, or "searching by evaluation", which heavily relies on heuristics and thus usually yields unreliable contig paths. RESULTS We here report an accurate and efficient approach to filling gaps of genome scaffolds with aids of optical maps. Using simulated data from 12 species and real data from 3 species, we demonstrate the successful application of our approach in gap filling with improved accuracy and completeness of genome scaffolds. CONCLUSION Our approach applies a sequential Bayesian updating technique to measure the similarity between optical maps and candidate contig paths. Using this similarity to guide path searching, our approach achieves higher accuracy than the existing "searching by evaluation" strategy that relies on heuristics. Furthermore, unlike the "searching followed by evaluation" strategy enumerating all possible paths, our approach prunes the unlikely sub-paths and extends the highly-probable ones only, thus significantly increasing searching efficiency.
Collapse
Affiliation(s)
- Bin Huang
- Key Lab of Intelligent Information Processing, Big-Data Academy, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190 China
- Institute of Biology, University of Chinese Academy of Sciences, Beijing, 100049 China
| | - Guozheng Wei
- Key Lab of Intelligent Information Processing, Big-Data Academy, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190 China
- Institute of Biology, University of Chinese Academy of Sciences, Beijing, 100049 China
| | - Bing Wang
- Key Lab of Intelligent Information Processing, Big-Data Academy, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190 China
- Institute of Biology, University of Chinese Academy of Sciences, Beijing, 100049 China
| | - Fusong Ju
- Key Lab of Intelligent Information Processing, Big-Data Academy, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190 China
- Institute of Biology, University of Chinese Academy of Sciences, Beijing, 100049 China
| | - Yi Zhong
- School of Computer Science, University of Washington, Seattle, 98195 USA
| | - Zhuozheng Shi
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, 92093 USA
| | - Shiwei Sun
- Key Lab of Intelligent Information Processing, Big-Data Academy, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190 China
- Institute of Biology, University of Chinese Academy of Sciences, Beijing, 100049 China
- Zhongke Big Data Academy, Zhengzhou, 450046 Henan China
| | - Dongbo Bu
- Key Lab of Intelligent Information Processing, Big-Data Academy, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190 China
- Institute of Biology, University of Chinese Academy of Sciences, Beijing, 100049 China
- Zhongke Big Data Academy, Zhengzhou, 450046 Henan China
| |
Collapse
|
5
|
Abstract
In genomics, optical mapping technology provides long-range contiguity information to improve genome sequence assemblies and detect structural variation. Originally a laborious manual process, Bionano Genomics platforms now offer high-throughput, automated optical mapping based on chips packed with nanochannels through which unwound DNA is guided and the fluorescent DNA backbone and specific restriction sites are recorded. Although the raw image data obtained is of high quality, the processing and assembly software accompanying the platforms is closed source and does not seem to make full use of data, labeling approximately half of the measured signals as unusable. Here we introduce two new software tools, independent of Bionano Genomics software, to extract and process molecules from raw images (OptiScan) and to perform molecule-to-molecule and molecule-to-reference alignments using a novel signal-based approach (OptiMap). We demonstrate that the molecules detected by OptiScan can yield better assemblies, and that the approach taken by OptiMap results in higher use of molecules from the raw data. These tools lay the foundation for a suite of open-source methods to process and analyze high-throughput optical mapping data. The Python implementations of the OptiTools are publicly available through http://www.bif.wur.nl/.
Collapse
|
6
|
Luo J, Wei Y, Lyu M, Wu Z, Liu X, Luo H, Yan C. A comprehensive review of scaffolding methods in genome assembly. Brief Bioinform 2021; 22:6149347. [PMID: 33634311 DOI: 10.1093/bib/bbab033] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 01/21/2021] [Accepted: 01/22/2021] [Indexed: 12/20/2022] Open
Abstract
In the field of genome assembly, scaffolding methods make it possible to obtain a more complete and contiguous reference genome, which is the cornerstone of genomic research. Scaffolding methods typically utilize the alignments between contigs and sequencing data (reads) to determine the orientation and order among contigs and to produce longer scaffolds, which are helpful for genomic downstream analysis. With the rapid development of high-throughput sequencing technologies, diverse types of reads have emerged over the past decade, especially in long-range sequencing, which have greatly enhanced the assembly quality of scaffolding methods. As the number of scaffolding methods increases, biology and bioinformatics researchers need to perform in-depth analyses of state-of-the-art scaffolding methods. In this article, we focus on the difficulties in scaffolding, the differences in characteristics among various kinds of reads, the methods by which current scaffolding methods address these difficulties, and future research opportunities. We hope this work will benefit the design of new scaffolding methods and the selection of appropriate scaffolding methods for specific biological studies.
Collapse
Affiliation(s)
- Junwei Luo
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Yawei Wei
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Mengna Lyu
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Zhengjiang Wu
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Xiaoyan Liu
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Huimin Luo
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Chaokun Yan
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| |
Collapse
|
7
|
Salmela L, Mukherjee K, Puglisi SJ, Muggli MD, Boucher C. Fast and accurate correction of optical mapping data via spaced seeds. Bioinformatics 2020; 36:682-689. [PMID: 31504206 PMCID: PMC7005598 DOI: 10.1093/bioinformatics/btz663] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2019] [Revised: 07/25/2019] [Accepted: 08/30/2019] [Indexed: 11/24/2022] Open
Abstract
Motivation Optical mapping data is used in many core genomics applications, including structural variation detection, scaffolding assembled contigs and mis-assembly detection. However, the pervasiveness of spurious and deleted cut sites in the raw data, which are called Rmaps, make assembly and alignment of them challenging. Although there exists another method to error correct Rmap data, named cOMet, it is unable to scale to even moderately large sized genomes. The challenge faced in error correction is in determining pairs of Rmaps that originate from the same region of the same genome. Results We create an efficient method for determining pairs of Rmaps that contain significant overlaps between them. Our method relies on the novel and nontrivial adaption and application of spaced seeds in the context of optical mapping, which allows for spurious and deleted cut sites to be accounted for. We apply our method to detecting and correcting these errors. The resulting error correction method, referred to as Elmeri, improves upon the results of state-of-the-art correction methods but in a fraction of the time. More specifically, cOMet required 9.9 CPU days to error correct Rmap data generated from the human genome, whereas Elmeri required less than 15 CPU hours and improved the quality of the Rmaps by more than four times compared to cOMet. Availability and implementation Elmeri is publicly available under GNU Affero General Public License at https://github.com/LeenaSalmela/Elmeri. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Leena Salmela
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, FI-00014 University of Helsinki, Helsinki 00100, Finland
| | - Kingshuk Mukherjee
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Simon J Puglisi
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, FI-00014 University of Helsinki, Helsinki 00100, Finland
| | - Martin D Muggli
- Department of Computer Science, Colorado State University, Fort Collins, CO 80523, USA
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, USA
| |
Collapse
|
8
|
Yuan Y, Chung CYL, Chan TF. Advances in optical mapping for genomic research. Comput Struct Biotechnol J 2020; 18:2051-2062. [PMID: 32802277 PMCID: PMC7419273 DOI: 10.1016/j.csbj.2020.07.018] [Citation(s) in RCA: 57] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2020] [Revised: 07/08/2020] [Accepted: 07/24/2020] [Indexed: 12/28/2022] Open
Abstract
Recent advances in optical mapping have allowed the construction of improved genome assemblies with greater contiguity. Optical mapping also enables genome comparison and identification of large-scale structural variations. Association of these large-scale genomic features with biological functions is an important goal in plant and animal breeding and in medical research. Optical mapping has also been used in microbiology and still plays an important role in strain typing and epidemiological studies. Here, we review the development of optical mapping in recent decades to illustrate its importance in genomic research. We detail its applications and algorithms to show its specific advantages. Finally, we discuss the challenges required to facilitate the optimization of optical mapping and improve its future development and application.
Collapse
Key Words
- 3D, three-dimensional
- DBG, de Bruijn graph
- DLS, direct label and strain
- DNA, deoxyribonucleic acid
- Genome assembly
- Hi-C, high-throughput chromosome conformation capture
- Mb, million base pair
- Next generation sequencing
- OLC, overlap-layout-consensus
- Optical mapping
- PCR, polymerase chain reaction
- PacBio, Pacific Biosciences
- SRS, short-read sequencing
- SV, structural variation
- Structural variation
- bp, base pair
- kb, kilobase pair
Collapse
Affiliation(s)
- Yuxuan Yuan
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
- State Key Laboratory for Agrobiotechnology, The Chinese University of Hong Kong, Hong Kong SAR, China
- AoE Centre for Genomic Studies on Plant-Environment Interaction for Sustainable Agriculture and Food Security, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Claire Yik-Lok Chung
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
- State Key Laboratory for Agrobiotechnology, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Ting-Fung Chan
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
- State Key Laboratory for Agrobiotechnology, The Chinese University of Hong Kong, Hong Kong SAR, China
- AoE Centre for Genomic Studies on Plant-Environment Interaction for Sustainable Agriculture and Food Security, The Chinese University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|
9
|
Abstract
BACKGROUND The long reads produced by third generation sequencing technologies have significantly boosted the results of genome assembly but still, genome-wide assemblies solely based on read data cannot be produced. Thus, for example, optical mapping data has been used to further improve genome assemblies but it has mostly been applied in a post-processing stage after contig assembly. RESULTS We propose OPTICALKERMIT which directly integrates genome wide optical maps into contig assembly. We show how genome wide optical maps can be used to localize reads on the genome and then we adapt the Kermit method, which originally incorporated genetic linkage maps to the miniasm assembler, to use this information in contig assembly. Our experimental results show that incorporating genome wide optical maps to the contig assembly of miniasm increases NGA50 while the number of misassemblies decreases or stays the same. Furthermore, when compared to the Canu assembler, OPTICALKERMIT produces an assembly with almost three times higher NGA50 with a lower number of misassemblies on real A. thaliana reads. CONCLUSIONS OPTICALKERMIT successfully incorporates optical mapping data directly to contig assembly of eukaryotic genomes. Our results show that this is a promising approach to improve the contiguity of genome assemblies.
Collapse
Affiliation(s)
- Miika Leinonen
- Department of Computer Science, Helsinki Institute for Information Technology, University of Helsinki, Pietari Kalmin katu 5, Helsinki, Finland
| | - Leena Salmela
- Department of Computer Science, Helsinki Institute for Information Technology, University of Helsinki, Pietari Kalmin katu 5, Helsinki, Finland.
| |
Collapse
|
10
|
Mukherjee K, Alipanahi B, Kahveci T, Salmela L, Boucher C. Aligning optical maps to de Bruijn graphs. Bioinformatics 2020; 35:3250-3256. [PMID: 30698651 DOI: 10.1093/bioinformatics/btz069] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2018] [Revised: 12/31/2018] [Accepted: 01/25/2019] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Optical maps are high-resolution restriction maps (Rmaps) that give a unique numeric representation to a genome. Used in concert with sequence reads, they provide a useful tool for genome assembly and for discovering structural variations and rearrangements. Although they have been a regular feature of modern genome assembly projects, optical maps have been mainly used in post-processing step and not in the genome assembly process itself. Several methods have been proposed for pairwise alignment of single molecule optical maps-called Rmaps, or for aligning optical maps to assembled reads. However, the problem of aligning an Rmap to a graph representing the sequence data of the same genome has not been studied before. Such an alignment provides a mapping between two sets of data: optical maps and sequence data which will facilitate the usage of optical maps in the sequence assembly step itself. RESULTS We define the problem of aligning an Rmap to a de Bruijn graph and present the first algorithm for solving this problem which is based on a seed-and-extend approach. We demonstrate that our method is capable of aligning 73% of Rmaps generated from the Escherichia coli genome to the de Bruijn graph constructed from short reads generated from the same genome. We validate the alignments and show that our method achieves an accuracy of 99.6%. We also show that our method scales to larger genomes. In particular, we show that 76% of Rmaps can be aligned to the de Bruijn graph in the case of human data. AVAILABILITY AND IMPLEMENTATION The software for aligning optical maps to de Bruijn graph, omGraph is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/omGraph. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kingshuk Mukherjee
- Department of Computer and Information Science and Engineering, College of Engineering, University of Florida, Gainesville, USA
| | - Bahar Alipanahi
- Department of Computer and Information Science and Engineering, College of Engineering, University of Florida, Gainesville, USA
| | - Tamer Kahveci
- Department of Computer and Information Science and Engineering, College of Engineering, University of Florida, Gainesville, USA
| | - Leena Salmela
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, Finland
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, College of Engineering, University of Florida, Gainesville, USA
| |
Collapse
|
11
|
Rideau F, Le Roy C, Sagné E, Renaudin H, Pereyre S, Henrich B, Dordet-Frisoni E, Citti C, Lartigue C, Bébéar C. Random transposon insertion in the Mycoplasma hominis minimal genome. Sci Rep 2019; 9:13554. [PMID: 31537861 PMCID: PMC6753208 DOI: 10.1038/s41598-019-49919-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2019] [Accepted: 08/30/2019] [Indexed: 11/09/2022] Open
Abstract
Mycoplasma hominis is an opportunistic human pathogen associated with genital and neonatal infections. Until this study, the lack of a reliable transformation method for the genetic manipulation of M. hominis hindered the investigation of the pathogenicity and the peculiar arginine-based metabolism of this bacterium. A genomic analysis of 20 different M. hominis strains revealed a number of putative restriction-modification systems in this species. Despite the presence of these systems, a reproducible polyethylene glycol (PEG)-mediated transformation protocol was successfully developed in this study for three different strains: two clinical isolates and the M132 reference strain. Transformants were generated by transposon mutagenesis with an efficiency of approximately 10-9 transformants/cell/µg plasmid and were shown to carry single or multiple mini-transposons randomly inserted within their genomes. One M132-mutant was observed to carry a single-copy transposon inserted within the gene encoding P75, a protein potentially involved in adhesion. However, no difference in adhesion was observed in cell-assays between this mutant and the M132 parent strain. Whole genome sequencing of mutants carrying multiple copies of the transposon further revealed the occurrence of genomic rearrangements. Overall, this is the first time that genetically modified strains of M. hominis have been obtained by random mutagenesis using a mini-transposon conferring resistance to tetracycline.
Collapse
Affiliation(s)
- Fabien Rideau
- University of Bordeaux, USC-EA3671 Mycoplasmal and Chlamydial Infections in Humans, Bordeaux, France.,INRA, USC-EA3671 Mycoplasmal and Chlamydial Infections in Humans, Bordeaux, France
| | - Chloé Le Roy
- University of Bordeaux, USC-EA3671 Mycoplasmal and Chlamydial Infections in Humans, Bordeaux, France.,INRA, USC-EA3671 Mycoplasmal and Chlamydial Infections in Humans, Bordeaux, France
| | - Eveline Sagné
- IHAP, Université de Toulouse, INRA, ENVT, Toulouse, France
| | - Hélène Renaudin
- University of Bordeaux, USC-EA3671 Mycoplasmal and Chlamydial Infections in Humans, Bordeaux, France.,INRA, USC-EA3671 Mycoplasmal and Chlamydial Infections in Humans, Bordeaux, France
| | - Sabine Pereyre
- University of Bordeaux, USC-EA3671 Mycoplasmal and Chlamydial Infections in Humans, Bordeaux, France.,INRA, USC-EA3671 Mycoplasmal and Chlamydial Infections in Humans, Bordeaux, France
| | - Birgit Henrich
- Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University, Düsseldorf, Germany
| | | | | | - Carole Lartigue
- INRA, UMR 1332 de Biologie du Fruit et Pathologie, F-33140 Villenave d'Ornon, Gironde, France. .,University of Bordeaux, UMR 1332 de Biologie du Fruit et Pathologie, F-33140 Villenave d'Ornon, Gironde, France.
| | - Cécile Bébéar
- University of Bordeaux, USC-EA3671 Mycoplasmal and Chlamydial Infections in Humans, Bordeaux, France. .,INRA, USC-EA3671 Mycoplasmal and Chlamydial Infections in Humans, Bordeaux, France.
| |
Collapse
|
12
|
Walve R, Rastas P, Salmela L. Kermit: linkage map guided long read assembly. Algorithms Mol Biol 2019; 14:8. [PMID: 30930956 PMCID: PMC6425630 DOI: 10.1186/s13015-019-0143-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Accepted: 03/13/2019] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND With long reads getting even longer and cheaper, large scale sequencing projects can be accomplished without short reads at an affordable cost. Due to the high error rates and less mature tools, de novo assembly of long reads is still challenging and often results in a large collection of contigs. Dense linkage maps are collections of markers whose location on the genome is approximately known. Therefore they provide long range information that has the potential to greatly aid in de novo assembly. Previously linkage maps have been used to detect misassemblies and to manually order contigs. However, no fully automated tools exist to incorporate linkage maps in assembly but instead large amounts of manual labour is needed to order the contigs into chromosomes. RESULTS We formulate the genome assembly problem in the presence of linkage maps and present the first method for guided genome assembly using linkage maps. Our method is based on an additional cleaning step added to the assembly. We show that it can simplify the underlying assembly graph, resulting in more contiguous assemblies and reducing the amount of misassemblies when compared to de novo assembly. CONCLUSIONS We present the first method to integrate linkage maps directly into genome assembly. With a modest increase in runtime, our method improves contiguity and correctness of genome assembly.
Collapse
Affiliation(s)
- Riku Walve
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, Finland
| | - Pasi Rastas
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
| | - Leena Salmela
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, Finland
| |
Collapse
|
13
|
Anderson TJC, LoVerde PT, Le Clec'h W, Chevalier FD. Genetic Crosses and Linkage Mapping in Schistosome Parasites. Trends Parasitol 2018; 34:982-996. [PMID: 30150002 DOI: 10.1016/j.pt.2018.08.001] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2018] [Revised: 07/27/2018] [Accepted: 08/02/2018] [Indexed: 12/14/2022]
Abstract
Linkage mapping - utilizing experimental genetic crosses to examine cosegregation of phenotypic traits with genetic markers - is now 100 years old. Schistosome parasites are exquisitely well suited to linkage mapping approaches because genetic crosses can be conducted in the laboratory, thousands of progeny are produced, and elegant experimental work over the last 75 years has revealed heritable genetic variation in multiple biomedically important traits such as drug resistance, host specificity, and virulence. Application of this approach is timely because the improved genome assembly for Schistosoma mansoni and developing molecular toolkit for schistosomes increase our ability to link phenotype with genotype. We describe current progress and potential future directions of linkage mapping in schistosomes.
Collapse
Affiliation(s)
| | | | - Winka Le Clec'h
- Texas Biomedical Research Institute, San Antonio, Texas 78227, USA
| | | |
Collapse
|
14
|
Leung AKY, Kwok TP, Wan R, Xiao M, Kwok PY, Yip KY, Chan TF. OMBlast: alignment tool for optical mapping using a seed-and-extend approach. Bioinformatics 2018; 33:311-319. [PMID: 28172448 PMCID: PMC5409310 DOI: 10.1093/bioinformatics/btw620] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2015] [Revised: 08/31/2016] [Accepted: 09/26/2016] [Indexed: 11/15/2022] Open
Abstract
Motivation Optical mapping is a technique for capturing fluorescent signal patterns of long DNA molecules (in the range of 0.1–1 Mbp). Recently, it has been complementing the widely used short-read sequencing technology by assisting with scaffolding and detecting large and complex structural variations (SVs). Here, we introduce a fast, robust and accurate tool called OMBlast for aligning optical maps, the set of signal locations on the molecules generated from optical mapping. Our method is based on the seed-and-extend approach from sequence alignment, with modifications specific to optical mapping. Results Experiments with both synthetic and our real data demonstrate that OMBlast has higher accuracy and faster mapping speed than existing alignment methods. Our tool also shows significant improvement when aligning data with SVs. Availability and Implementation OMBlast is implemented for Java 1.7 and is released under a GPL license. OMBlast can be downloaded from https://github.com/aldenleung/OMBlast and run directly on machines equipped with a Java virtual machine. Supplementary information Supplementary data are available at Bioinformatics online
Collapse
Affiliation(s)
| | - Tsz-Piu Kwok
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China
| | - Raymond Wan
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong, China
| | - Ming Xiao
- School of Biomedical Engineering, Science and Health System, Drexel University, Philadelphia, PA, USA
| | - Pui-Yan Kwok
- Institute for Human Genetics.,Cardiovascular Research Institute, University of California San Francisco, San Francisco, CA, USA
| | - Kevin Y Yip
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China.,Hong Kong Bioinformatics Centre
| | - Ting-Fung Chan
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong, China,Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China.,Centre for Soybean Research, State Key Laboratory of Agrobiotechnology, The Chinese University of Hong Kong, Hong Kong, China.,Hong Kong Bioinformatics Centre
| |
Collapse
|
15
|
Li L, Leung AKY, Kwok TP, Lai YYY, Pang IK, Chung GTY, Mak ACY, Poon A, Chu C, Li M, Wu JJK, Lam ET, Cao H, Lin C, Sibert J, Yiu SM, Xiao M, Lo KW, Kwok PY, Chan TF, Yip KY. OMSV enables accurate and comprehensive identification of large structural variations from nanochannel-based single-molecule optical maps. Genome Biol 2017; 18:230. [PMID: 29195502 PMCID: PMC5709945 DOI: 10.1186/s13059-017-1356-2] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2017] [Accepted: 11/03/2017] [Indexed: 12/20/2022] Open
Abstract
We present a new method, OMSV, for accurately and comprehensively identifying structural variations (SVs) from optical maps. OMSV detects both homozygous and heterozygous SVs, SVs of various types and sizes, and SVs with or without creating or destroying restriction sites. We show that OMSV has high sensitivity and specificity, with clear performance gains over the latest method. Applying OMSV to a human cell line, we identified hundreds of SVs >2 kbp, with 68 % of them missed by sequencing-based callers. Independent experimental validation confirmed the high accuracy of these SVs. The OMSV software is available at http://yiplab.cse.cuhk.edu.hk/omsv/ .
Collapse
Affiliation(s)
- Le Li
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
| | - Alden King-Yung Leung
- School of Life Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
| | - Tsz-Piu Kwok
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
| | - Yvonne Y Y Lai
- Cardiovascular Research Institute, University of California San Francisco, San Francisco, California, USA
| | - Iris K Pang
- School of Life Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
| | - Grace Tin-Yun Chung
- Department of Anatomical and Cellular Pathology, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
| | - Angel C Y Mak
- Cardiovascular Research Institute, University of California San Francisco, San Francisco, California, USA
| | - Annie Poon
- Cardiovascular Research Institute, University of California San Francisco, San Francisco, California, USA
| | - Catherine Chu
- Cardiovascular Research Institute, University of California San Francisco, San Francisco, California, USA
| | - Menglu Li
- Department of Computer Science, The University of Hong Kong, Pokfulam, Hong Kong
| | - Jacob J K Wu
- Department of Computer Science, The University of Hong Kong, Pokfulam, Hong Kong
| | | | - Han Cao
- BioNano Genomics, San Diego, California, USA
| | - Chin Lin
- Cardiovascular Research Institute, University of California San Francisco, San Francisco, California, USA
| | - Justin Sibert
- School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, Pennsylvania, USA
| | - Siu-Ming Yiu
- Department of Computer Science, The University of Hong Kong, Pokfulam, Hong Kong
| | - Ming Xiao
- School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, Pennsylvania, USA
| | - Kwok-Wai Lo
- Department of Anatomical and Cellular Pathology, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
| | - Pui-Yan Kwok
- Cardiovascular Research Institute, University of California San Francisco, San Francisco, California, USA.,Institute for Human Genetics, University of California San Francisco, San Francisco, California, USA
| | - Ting-Fung Chan
- School of Life Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong. .,Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong. .,Hong Kong Institute of Diabetes and Obesity, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong. .,CUHK-BGI Innovation Institute of Trans-omics, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong.
| | - Kevin Y Yip
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong. .,Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong. .,Hong Kong Institute of Diabetes and Obesity, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong. .,CUHK-BGI Innovation Institute of Trans-omics, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong.
| |
Collapse
|
16
|
Jiao WB, Schneeberger K. The impact of third generation genomic technologies on plant genome assembly. CURRENT OPINION IN PLANT BIOLOGY 2017; 36:64-70. [PMID: 28231512 DOI: 10.1016/j.pbi.2017.02.002] [Citation(s) in RCA: 115] [Impact Index Per Article: 16.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/17/2016] [Revised: 02/06/2017] [Accepted: 02/07/2017] [Indexed: 05/20/2023]
Abstract
Since the introduction of next generation sequencing, plant genome assembly projects do not need to rely on dedicated research facilities or community-wide consortia anymore, even individual research groups can sequence and assemble the genomes they are interested in. However, such assemblies are typically not based on the entire breadth of genomic technologies including genetic and physical maps and their contiguities tend to be low compared to the full-length gold standard reference sequences. Recently emerging third generation genomic technologies like long-read sequencing or optical mapping promise to bridge this quality gap and enable simple and cost-effective solutions for chromosomal-level assemblies.
Collapse
Affiliation(s)
- Wen-Biao Jiao
- Max Planck Institute for Plant Breeding Research, Department of Plant Developmental Biology, Genome Plasticity and Computational Genomics, Cologne, Germany
| | - Korbinian Schneeberger
- Max Planck Institute for Plant Breeding Research, Department of Plant Developmental Biology, Genome Plasticity and Computational Genomics, Cologne, Germany.
| |
Collapse
|
17
|
The hidden perils of read mapping as a quality assessment tool in genome sequencing. Sci Rep 2017; 7:43149. [PMID: 28225089 PMCID: PMC5320493 DOI: 10.1038/srep43149] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2016] [Accepted: 01/20/2017] [Indexed: 11/16/2022] Open
Abstract
This article provides a comparative analysis of the various methods of genome sequencing focusing on verification of the assembly quality. The results of a comparative assessment of various de novo assembly tools, as well as sequencing technologies, are presented using a recently completed sequence of the genome of Lactobacillus fermentum 3872. In particular, quality of assemblies is assessed by using CLC Genomics Workbench read mapping and Optical mapping developed by OpGen. Over-extension of contigs without prior knowledge of contig location can lead to misassembled contigs, even when commonly used quality indicators such as read mapping suggest that a contig is well assembled. Precautions must also be undertaken when using long read sequencing technology, which may also lead to misassembled contigs.
Collapse
|
18
|
Chaney L, Sharp AR, Evans CR, Udall JA. Genome Mapping in Plant Comparative Genomics. TRENDS IN PLANT SCIENCE 2016; 21:770-780. [PMID: 27289181 DOI: 10.1016/j.tplants.2016.05.004] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/04/2016] [Revised: 04/27/2016] [Accepted: 05/12/2016] [Indexed: 05/10/2023]
Abstract
Genome mapping produces fingerprints of DNA sequences to construct a physical map of the whole genome. It provides contiguous, long-range information that complements and, in some cases, replaces sequencing data. Recent advances in genome-mapping technology will better allow researchers to detect large (>1kbp) structural variations between plant genomes. Some molecular and informatics complications need to be overcome for this novel technology to achieve its full utility. This technology will be useful for understanding phenotype responses due to DNA rearrangements and will yield insights into genome evolution, particularly in polyploids. In this review, we outline recent advances in genome-mapping technology, including the processes required for data collection and analysis, and applications in plant comparative genomics.
Collapse
Affiliation(s)
- Lindsay Chaney
- Plant and Wildlife Sciences Department, Brigham Young University, Provo, UT 84602, USA
| | - Aaron R Sharp
- Plant and Wildlife Sciences Department, Brigham Young University, Provo, UT 84602, USA
| | - Carrie R Evans
- Plant and Wildlife Sciences Department, Brigham Young University, Provo, UT 84602, USA
| | - Joshua A Udall
- Plant and Wildlife Sciences Department, Brigham Young University, Provo, UT 84602, USA.
| |
Collapse
|
19
|
Verzotto D, M. Teo AS, Hillmer AM, Nagarajan N. OPTIMA: sensitive and accurate whole-genome alignment of error-prone genomic maps by combinatorial indexing and technology-agnostic statistical analysis. Gigascience 2016; 5:2. [PMID: 26793302 PMCID: PMC4719737 DOI: 10.1186/s13742-016-0110-0] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2015] [Accepted: 01/06/2016] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Resolution of complex repeat structures and rearrangements in the assembly and analysis of large eukaryotic genomes is often aided by a combination of high-throughput sequencing and genome-mapping technologies (for example, optical restriction mapping). In particular, mapping technologies can generate sparse maps of large DNA fragments (150 kilo base pairs (kbp) to 2 Mbp) and thus provide a unique source of information for disambiguating complex rearrangements in cancer genomes. Despite their utility, combining high-throughput sequencing and mapping technologies has been challenging because of the lack of efficient and sensitive map-alignment algorithms for robustly aligning error-prone maps to sequences. RESULTS We introduce a novel seed-and-extend glocal (short for global-local) alignment method, OPTIMA (and a sliding-window extension for overlap alignment, OPTIMA-Overlap), which is the first to create indexes for continuous-valued mapping data while accounting for mapping errors. We also present a novel statistical model, agnostic with respect to technology-dependent error rates, for conservatively evaluating the significance of alignments without relying on expensive permutation-based tests. CONCLUSIONS We show that OPTIMA and OPTIMA-Overlap outperform other state-of-the-art approaches (1.6-2 times more sensitive) and are more efficient (170-200 %) and precise in their alignments (nearly 99 % precision). These advantages are independent of the quality of the data, suggesting that our indexing approach and statistical evaluation are robust, provide improved sensitivity and guarantee high precision.
Collapse
Affiliation(s)
- Davide Verzotto
- Computational and Systems Biology, Genome Institute of Singapore, 60 Biopolis Street, Singapore, 138672 Singapore
| | - Audrey S. M. Teo
- Cancer Therapeutics and Stratified Oncology, Genome Institute of Singapore, 60 Biopolis Street, Singapore, 138672 Singapore
| | - Axel M. Hillmer
- Cancer Therapeutics and Stratified Oncology, Genome Institute of Singapore, 60 Biopolis Street, Singapore, 138672 Singapore
| | - Niranjan Nagarajan
- Computational and Systems Biology, Genome Institute of Singapore, 60 Biopolis Street, Singapore, 138672 Singapore
| |
Collapse
|
20
|
Towards a More Accurate Error Model for BioNano Optical Maps. BIOINFORMATICS RESEARCH AND APPLICATIONS 2016. [DOI: 10.1007/978-3-319-38782-6_6] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
21
|
Yuan B, Liu P, Gupta A, Beck CR, Tejomurtula A, Campbell IM, Gambin T, Simmons AD, Withers MA, Harris RA, Rogers J, Schwartz DC, Lupski JR. Comparative Genomic Analyses of the Human NPHP1 Locus Reveal Complex Genomic Architecture and Its Regional Evolution in Primates. PLoS Genet 2015; 11:e1005686. [PMID: 26641089 PMCID: PMC4671654 DOI: 10.1371/journal.pgen.1005686] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2015] [Accepted: 10/29/2015] [Indexed: 11/30/2022] Open
Abstract
Many loci in the human genome harbor complex genomic structures that can result in susceptibility to genomic rearrangements leading to various genomic disorders. Nephronophthisis 1 (NPHP1, MIM# 256100) is an autosomal recessive disorder that can be caused by defects of NPHP1; the gene maps within the human 2q13 region where low copy repeats (LCRs) are abundant. Loss of function of NPHP1 is responsible for approximately 85% of the NPHP1 cases—about 80% of such individuals carry a large recurrent homozygous NPHP1 deletion that occurs via nonallelic homologous recombination (NAHR) between two flanking directly oriented ~45 kb LCRs. Published data revealed a non-pathogenic inversion polymorphism involving the NPHP1 gene flanked by two inverted ~358 kb LCRs. Using optical mapping and array-comparative genomic hybridization, we identified three potential novel structural variant (SV) haplotypes at the NPHP1 locus that may protect a haploid genome from the NPHP1 deletion. Inter-species comparative genomic analyses among primate genomes revealed massive genomic changes during evolution. The aggregated data suggest that dynamic genomic rearrangements occurred historically within the NPHP1 locus and generated SV haplotypes observed in the human population today, which may confer differential susceptibility to genomic instability and the NPHP1 deletion within a personal genome. Our study documents diverse SV haplotypes at a complex LCR-laden human genomic region. Comparative analyses provide a model for how this complex region arose during primate evolution, and studies among humans suggest that intra-species polymorphism may potentially modulate an individual’s susceptibility to acquiring disease-associated alleles. Genomic instability due to the intrinsic sequence architecture of the genome, such as low copy repeats (LCRs), is a major contributor to de novo mutations that can occur in the process of human genome evolution. LCRs can mediate genomic rearrangements associated with genomic disorders by acting as substrates for nonallelic homologous recombination. Juvenile-onset nephronophthisis 1 is the most frequent genetic cause of renal failure in children. An LCR-mediated, homozygous common recurrent deletion encompassing NPHP1 is found in the majority of affected subjects, while heterozygous deletion representing the nephronophthisis 1 recessive carrier state is frequently observed amongst world populations. Interestingly, the human NPHP1 locus is located proximal to the head-to-head fusion site of two ancestral chromosomes that occurred in the great apes, which resulted in a reduction of chromosome number from 48 in nonhuman primates to the current 46 in humans. In this study, we characterized and provided evidence for the diverse genomic architecture at the NPHP1 locus and potential structural variant haplotypes in the human population. Furthermore, our analyses of primate genomes shed light on the massive changes of genomic architecture at the human NPHP1 locus and delineated a model for the emergence of the LCRs during primate evolution.
Collapse
Affiliation(s)
- Bo Yuan
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - Pengfei Liu
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - Aditya Gupta
- Laboratory for Molecular and Computational Genomics, Department of Chemistry, Laboratory of Genetics and The UW-Biotechnology Center, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
| | - Christine R. Beck
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - Anusha Tejomurtula
- Graduate Program in Diagnostic Genetics, School of Health Professions, University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
| | - Ian M. Campbell
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - Tomasz Gambin
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - Alexandra D. Simmons
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - Marjorie A. Withers
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - R. Alan Harris
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, United States of America
| | - Jeffrey Rogers
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, United States of America
| | - David C. Schwartz
- Laboratory for Molecular and Computational Genomics, Department of Chemistry, Laboratory of Genetics and The UW-Biotechnology Center, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
| | - James R. Lupski
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, United States of America
- Department of Pediatrics, Baylor College of Medicine, Houston, Texas, United States of America
- Texas Children’s Hospital, Houston, Texas, United States of America
- * E-mail:
| |
Collapse
|
22
|
Muggli MD, Puglisi SJ, Ronen R, Boucher C. Misassembly detection using paired-end sequence reads and optical mapping data. Bioinformatics 2015; 31:i80-8. [PMID: 26072512 PMCID: PMC4542784 DOI: 10.1093/bioinformatics/btv262] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Motivation: A crucial problem in genome assembly is the discovery and correction of misassembly errors in draft genomes. We develop a method called misSEQuel that enhances the quality of draft genomes by identifying misassembly errors and their breakpoints using paired-end sequence reads and optical mapping data. Our method also fulfills the critical need for open source computational methods for analyzing optical mapping data. We apply our method to various assemblies of the loblolly pine, Francisella tularensis, rice and budgerigar genomes. We generated and used stimulated optical mapping data for loblolly pine and F.tularensis and used real optical mapping data for rice and budgerigar. Results: Our results demonstrate that we detect more than 54% of extensively misassembled contigs and more than 60% of locally misassembled contigs in assemblies of F.tularensis and between 31% and 100% of extensively misassembled contigs and between 57% and 73% of locally misassembled contigs in assemblies of loblolly pine. Using the real optical mapping data, we correctly identified 75% of extensively misassembled contigs and 100% of locally misassembled contigs in rice, and 77% of extensively misassembled contigs and 80% of locally misassembled contigs in budgerigar. Availability and implementation:misSEQuel can be used as a post-processing step in combination with any genome assembler and is freely available at http://www.cs.colostate.edu/seq/. Contact:muggli@cs.colostate.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Martin D Muggli
- Department of Computer Science, Colorado State University, Fort Collins, CO 80526, USA, Department of Computer Science, University of Helsinki, Finland and Bioinformatics Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA
| | - Simon J Puglisi
- Department of Computer Science, Colorado State University, Fort Collins, CO 80526, USA, Department of Computer Science, University of Helsinki, Finland and Bioinformatics Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA
| | - Roy Ronen
- Department of Computer Science, Colorado State University, Fort Collins, CO 80526, USA, Department of Computer Science, University of Helsinki, Finland and Bioinformatics Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA
| | - Christina Boucher
- Department of Computer Science, Colorado State University, Fort Collins, CO 80526, USA, Department of Computer Science, University of Helsinki, Finland and Bioinformatics Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA
| |
Collapse
|
23
|
Pendleton M, Sebra R, Pang AWC, Ummat A, Franzen O, Rausch T, Stütz AM, Stedman W, Anantharaman T, Hastie A, Dai H, Fritz MHY, Cao H, Cohain A, Deikus G, Durrett RE, Blanchard SC, Altman R, Chin CS, Guo Y, Paxinos EE, Korbel JO, Darnell RB, McCombie WR, Kwok PY, Mason CE, Schadt EE, Bashir A. Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nat Methods 2015; 12:780-6. [PMID: 26121404 PMCID: PMC4646949 DOI: 10.1038/nmeth.3454] [Citation(s) in RCA: 334] [Impact Index Per Article: 37.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2014] [Accepted: 05/28/2015] [Indexed: 12/30/2022]
Abstract
We present the first comprehensive analysis of a diploid human genome that combines single-molecule sequencing with single-molecule genome maps. Our hybrid assembly markedly improves upon the contiguity observed from traditional shotgun sequencing approaches, with scaffold N50 values approaching 30 Mb, and we identified complex structural variants (SVs) missed by other high-throughput approaches. Furthermore, by combining Illumina short-read data with long reads, we phased both single-nucleotide variants and SVs, generating haplotypes with over 99% consistency with previous trio-based studies. Our work shows that it is now possible to integrate single-molecule and high-throughput sequence data to generate de novo assembled genomes that approach reference quality.
Collapse
Affiliation(s)
- Matthew Pendleton
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Robert Sebra
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | | | - Ajay Ummat
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Oscar Franzen
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Tobias Rausch
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Adrian M Stütz
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | | | | | - Alex Hastie
- BioNano Genomics, San Diego, California, USA
| | - Heng Dai
- BioNano Genomics, San Diego, California, USA
| | | | - Han Cao
- BioNano Genomics, San Diego, California, USA
| | - Ariella Cohain
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Gintaras Deikus
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Russell E Durrett
- The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Cornell Medical College, New York, New York, USA
| | - Scott C Blanchard
- Department of Physiology and Biophysics, Weill Cornell Medical College, New York, New York, USA
| | - Roger Altman
- The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Cornell Medical College, New York, New York, USA
| | | | - Yan Guo
- Pacific Biosciences, Menlo Park, California, USA
| | | | - Jan O Korbel
- 1] Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany. [2] European Bioinformatics Institute, European Molecular Biology Laboratory, Hinxton, UK
| | - Robert B Darnell
- 1] Laboratory of Neuro-Oncology, The Rockefeller University, New York, New York, USA. [2] Howard Hughes Medical Institute, New York, New York, USA
| | - W Richard McCombie
- 1] The Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA. [2] The Watson School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA
| | - Pui-Yan Kwok
- Institute for Human Genetics, University of California-San Francisco, San Francisco, California, USA
| | - Christopher E Mason
- 1] The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Cornell Medical College, New York, New York, USA. [2] Department of Medicine, Division of Hematology/Oncology, Weill Cornell Medical College, New York, New York, USA. [3] The Feil Family Brain and Mind Research Institute, Weill Cornell Medical College, New York, New York, USA
| | - Eric E Schadt
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Ali Bashir
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| |
Collapse
|
24
|
Abstract
Optical mapping has been widely used to improve de novo plant genome assemblies, including rice, maize, Medicago, Amborella, tomato and wheat, with more genomes in the pipeline. Optical mapping provides long-range information of the genome and can more easily identify large structural variations. The ability of optical mapping to assay long single DNA molecules nicely complements short-read sequencing which is more suitable for the identification of small and short-range variants. Direct use of optical mapping to study population-level genetic diversity is currently limited to microbial strain typing and human diversity studies. Nonetheless, optical mapping shows great promise in the study of plant trait development, domestication and polyploid evolution. Here we review the current applications and future prospects of optical mapping in the field of plant comparative genomics.
Collapse
Affiliation(s)
- Haibao Tang
- Center for Genomics and Biotechnology, Fujian Agriculture and Forestry University, Fuzhou, 350002, Fujian People's Republic of China ; School of Plant Sciences, iPlant Collaborative, University of Arizona, Tucson, AZ 85721 USA
| | - Eric Lyons
- School of Plant Sciences, iPlant Collaborative, University of Arizona, Tucson, AZ 85721 USA
| | | |
Collapse
|
25
|
Abstract
Optical mapping has been widely used to improve de novo plant genome assemblies, including rice, maize, Medicago, Amborella, tomato and wheat, with more genomes in the pipeline. Optical mapping provides long-range information of the genome and can more easily identify large structural variations. The ability of optical mapping to assay long single DNA molecules nicely complements short-read sequencing which is more suitable for the identification of small and short-range variants. Direct use of optical mapping to study population-level genetic diversity is currently limited to microbial strain typing and human diversity studies. Nonetheless, optical mapping shows great promise in the study of plant trait development, domestication and polyploid evolution. Here we review the current applications and future prospects of optical mapping in the field of plant comparative genomics.
Collapse
|
26
|
Hormozdiari F, Eskin E. Memory efficient assembly of human genome. J Bioinform Comput Biol 2015; 13:1550008. [PMID: 25603998 DOI: 10.1142/s0219720015500080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The ability to detect the genetic variations between two individuals is an essential component for genetic studies. In these studies, obtaining the genome sequence of both individuals is the first step toward variation detection problem. The emergence of high-throughput sequencing (HTS) technology has made DNA sequencing practical, and is widely used by diagnosticians to increase their knowledge about the casual factor in genetic related diseases. As HTS advances, more data are generated every day than the amount that scientists can process. Genome assembly is one of the existing methods to tackle the variation detection problem. The de Bruijn graph formulation of the assembly problem is widely used in the field. Furthermore, it is the only method which can assemble any genome in linear time. However, it requires an enormous amount of memory in order to assemble any mammalian size genome. The high demands of sequencing more individuals and the urge to assemble them are the driving forces for a memory efficient assembler. In this work, we propose a novel method which builds the de Bruijn graph while consuming lower memory. Moreover, our proposed method can reduce the memory usage by 37% compared to the existing methods. In addition, we used a real data set (chromosome 17 of A/J strain) to illustrate the performance of our method.
Collapse
Affiliation(s)
- Farhad Hormozdiari
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, USA
| | | |
Collapse
|
27
|
Abstract
The Genome 10K Project was established in 2009 by a consortium of biologists and genome scientists determined to facilitate the sequencing and analysis of the complete genomes of 10,000 vertebrate species. Since then the number of selected and initiated species has risen from ∼26 to 277 sequenced or ongoing with funding, an approximately tenfold increase in five years. Here we summarize the advances and commitments that have occurred by mid-2014 and outline the achievements and present challenges of reaching the 10,000-species goal. We summarize the status of known vertebrate genome projects, recommend standards for pronouncing a genome as sequenced or completed, and provide our present and future vision of the landscape of Genome 10K. The endeavor is ambitious, bold, expensive, and uncertain, but together the Genome 10K Consortium of Scientists and the worldwide genomics community are moving toward their goal of delivering to the coming generation the gift of genome empowerment for many vertebrate species.
Collapse
Affiliation(s)
- Klaus-Peter Koepfli
- Theodosius Dobzhansky Center for Genome Bioinformatics, St. Petersburg State University, 199034 St. Petersburg, Russian Federation;
| | | | | |
Collapse
|
28
|
Mendelowitz L, Pop M. Computational methods for optical mapping. Gigascience 2014; 3:33. [PMID: 25671093 PMCID: PMC4323141 DOI: 10.1186/2047-217x-3-33] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2014] [Accepted: 12/02/2014] [Indexed: 11/10/2022] Open
Abstract
Optical mapping and newer genome mapping technologies based on nicking enzymes provide low resolution but long-range genomic information. The optical mapping technique has been successfully used for assessing the quality of genome assemblies and for detecting large-scale structural variants and rearrangements that cannot be detected using current paired end sequencing protocols. Here, we review several algorithms and methods for building consensus optical maps and aligning restriction patterns to a reference map, as well as methods for using optical maps with sequence assemblies.
Collapse
Affiliation(s)
- Lee Mendelowitz
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD USA ; Applied Math & Statistics, and Scientific Computation, University of Maryland, College Park, MD USA
| | - Mihai Pop
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD USA ; Department of Computer Science, University of Maryland, College Park, MD USA
| |
Collapse
|
29
|
Shariat B, Movahedi NS, Chitsaz H, Boucher C. HyDA-Vista: towards optimal guided selection of k-mer size for sequence assembly. BMC Genomics 2014; 15 Suppl 10:S9. [PMID: 25558875 PMCID: PMC4304221 DOI: 10.1186/1471-2164-15-s10-s9] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Motivation Intimately tied to assembly quality is the complexity of the de Bruijn graph built by the assembler. Thus, there have been many paradigms developed to decrease the complexity of the de Bruijn graph. One obvious combinatorial paradigm for this is to allow the value of k to vary; having a larger value of k where the graph is more complex and a smaller value of k where the graph would likely contain fewer spurious edges and vertices. One open problem that affects the practicality of this method is how to predict the value of k prior to building the de Bruijn graph. We show that optimal values of k can be predicted prior to assembly by using the information contained in a phylogenetically-close genome and therefore, help make the use of multiple values of k practical for genome assembly. Results We present HyDA-Vista, which is a genome assembler that uses homology information to choose a value of k for each read prior to the de Bruijn graph construction. The chosen k is optimal if there are no sequencing errors and the coverage is sufficient. Fundamental to our method is the construction of the maximal sequence landscape, which is a data structure that stores for each position in the input string, the largest repeated substring containing that position. In particular, we show the maximal sequence landscape can be constructed in O(n + n log n)-time and O(n)-space. HyDA-Vista first constructs the maximal sequence landscape for a homologous genome. The reads are then aligned to this reference genome, and values of k are assigned to each read using the maximal sequence landscape and the alignments. Eventually, all the reads are assembled by an iterative de Bruijn graph construction method. Our results and comparison to other assemblers demonstrate that HyDA-Vista achieves the best assembly of E. coli before repeat resolution or scaffolding. Availability HyDA-Vista is freely available [1]. The code for constructing the maximal sequence landscape and choosing the optimal value of k for each read is also separately available on the website and could be incorporated into any genome assembler.
Collapse
|
30
|
Appels R, Nystrom-Persson J, Keeble-Gagnere G. Advances in genome studies in plants and animals. Funct Integr Genomics 2014; 14:1-9. [PMID: 24626952 PMCID: PMC3968518 DOI: 10.1007/s10142-014-0364-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2014] [Accepted: 02/19/2014] [Indexed: 01/30/2023]
Abstract
The area of plant and animal genomics covers the entire suite of issues in biology because it aims to determine the structure and function of genetic material. Although specific issues define research advances at an organism level, it is evident that many of the fundamental features of genome structure and the translation of encoded information to function share common ground. The Plant and Animal Genome (PAG) conference held in San Diego (California), in January each year provides an overview across all organisms at the genome level, and often it is evident that investments in the human area provide leadership, applications, and discoveries for researchers studying other organisms. This mini-review utilizes the plenary lectures as a basis for summarizing the trends in the genome-level studies of organisms, and the lectures include presentations by Ewan Birney (EBI, UK), Eric Green (NIH, USA), John Butler (NIST, USA), Elaine Mardis (Washington, USA), Caroline Dean (John Innes Centre, UK), Trudy Mackay (NC State University, USA), Sue Wessler (UC Riverside, USA), and Patrick Wincker (Genoscope, France). The work reviewed is based on published papers. Where unpublished information is cited, permission to include the information in this manuscript was obtained from the presenters.
Collapse
Affiliation(s)
- R Appels
- Veterinary and Life Sciences, Murdoch University, 90 South Street, Murdoch, Perth, WA, 6150, Australia,
| | | | | |
Collapse
|
31
|
|
32
|
Abstract
MOTIVATIONS Recent progress in ancient DNA sequencing technologies and protocols has lead to the sequencing of whole ancient bacterial genomes, as illustrated by the recent sequence of the Yersinia pestis strain that caused the Black Death pandemic. However, sequencing ancient genomes raises specific problems, because of the decay and fragmentation of ancient DNA among others, making the scaffolding of ancient contigs challenging. RESULTS We show that computational paleogenomics methods aimed at reconstructing the organization of ancestral genomes from the comparison of extant genomes can be adapted to correct, order and orient ancient bacterial contigs. We describe the method FPSAC (fast phylogenetic scaffolding of ancient contigs) and apply it on a set of 2134 ancient contigs assembled from the recently sequenced Black Death agent genome. We obtain a unique scaffold for the whole chromosome of this ancient genome that allows to gain precise insights into the structural evolution of the Yersinia clade.
Collapse
Affiliation(s)
- Ashok Rajaraman
- Department of Mathematics, Simon Fraser University, Burnaby (BC) V5A1S6, Canada, International Graduate Training Center in Mathematical Biology, Pacific Institute for the Mathematical Sciences, Vancouver (BC), Canada, INRIA Grenoble Rhône-Alpes, Montbonnot 38334, France, Université de Lyon 1, Laboratoire de Biométrie et Biologie Évolutive, CNRS UMR5558 F-69622 Villeurbanne, France and LaBRI, Université Bordeaux I, 33405 Talence, France
| | | | | |
Collapse
|
33
|
Hastie AR, Dong L, Smith A, Finklestein J, Lam ET, Huo N, Cao H, Kwok PY, Deal KR, Dvorak J, Luo MC, Gu Y, Xiao M. Rapid genome mapping in nanochannel arrays for highly complete and accurate de novo sequence assembly of the complex Aegilops tauschii genome. PLoS One 2013; 8:e55864. [PMID: 23405223 PMCID: PMC3566107 DOI: 10.1371/journal.pone.0055864] [Citation(s) in RCA: 123] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2012] [Accepted: 01/03/2013] [Indexed: 02/04/2023] Open
Abstract
Next-generation sequencing (NGS) technologies have enabled high-throughput and low-cost generation of sequence data; however, de novo genome assembly remains a great challenge, particularly for large genomes. NGS short reads are often insufficient to create large contigs that span repeat sequences and to facilitate unambiguous assembly. Plant genomes are notorious for containing high quantities of repetitive elements, which combined with huge genome sizes, makes accurate assembly of these large and complex genomes intractable thus far. Using two-color genome mapping of tiling bacterial artificial chromosomes (BAC) clones on nanochannel arrays, we completed high-confidence assembly of a 2.1-Mb, highly repetitive region in the large and complex genome of Aegilops tauschii, the D-genome donor of hexaploid wheat (Triticum aestivum). Genome mapping is based on direct visualization of sequence motifs on single DNA molecules hundreds of kilobases in length. With the genome map as a scaffold, we anchored unplaced sequence contigs, validated the initial draft assembly, and resolved instances of misassembly, some involving contigs <2 kb long, to dramatically improve the assembly from 75% to 95% complete.
Collapse
Affiliation(s)
- Alex R. Hastie
- BioNano Genomics, San Diego, California, United States of America
| | - Lingli Dong
- Genomics and Gene Discovery Research Unit, United States Department of Agriculture - Agricultural Research Service, Albany, California, United States of America
- Department of Plant Sciences, University of California Davis, Davis, California, United States of America
| | - Alexis Smith
- BioNano Genomics, San Diego, California, United States of America
| | - Jeff Finklestein
- BioNano Genomics, San Diego, California, United States of America
| | - Ernest T. Lam
- BioNano Genomics, San Diego, California, United States of America
| | - Naxin Huo
- Genomics and Gene Discovery Research Unit, United States Department of Agriculture - Agricultural Research Service, Albany, California, United States of America
- Department of Plant Sciences, University of California Davis, Davis, California, United States of America
| | - Han Cao
- BioNano Genomics, San Diego, California, United States of America
| | - Pui-Yan Kwok
- Institute for Human Genetics, University of California San Francisco, San Francisco, California, United States of America
| | - Karin R. Deal
- Department of Plant Sciences, University of California Davis, Davis, California, United States of America
| | - Jan Dvorak
- Department of Plant Sciences, University of California Davis, Davis, California, United States of America
| | - Ming-Cheng Luo
- Department of Plant Sciences, University of California Davis, Davis, California, United States of America
| | - Yong Gu
- Genomics and Gene Discovery Research Unit, United States Department of Agriculture - Agricultural Research Service, Albany, California, United States of America
- Department of Plant Sciences, University of California Davis, Davis, California, United States of America
- * E-mail: (MX); (YG)
| | - Ming Xiao
- BioNano Genomics, San Diego, California, United States of America
- * E-mail: (MX); (YG)
| |
Collapse
|