1
|
Liu M, Xu N, Chen B, Zhang Z, Chen X, Zhu Y, Hong W, Wang T, Zhang Q, Ye Y, Lu T, Qian H. Effects of different assembly strategies on gene annotation in activated sludge. ENVIRONMENTAL RESEARCH 2024; 252:119116. [PMID: 38734289 DOI: 10.1016/j.envres.2024.119116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/17/2024] [Revised: 04/27/2024] [Accepted: 05/08/2024] [Indexed: 05/13/2024]
Abstract
Activated sludge comprises diverse bacteria, fungi, and other microorganisms, featuring a rich repertoire of genes involved in antibiotic resistance, pollutant degradation, and elemental cycling. In this regard, hybrid assembly technology can revolutionize metagenomics by detecting greater gene diversity in environmental samples. Nonetheless, the optimal utilization and comparability of genomic information between hybrid assembly and short- or long-read technology remain unclear. To address this gap, we compared the performance of the hybrid assembly, short- and long-read technologies, abundance and diversity of annotated genes, and taxonomic diversity by analysing 46, 161, and 45 activated sludge metagenomic datasets, respectively. The results revealed that hybrid assembly technology exhibited the best performance, generating the most contiguous and longest contigs but with a lower proportion of high-quality metagenome-assembled genomes than short-read technology. Compared with short- or long-read technologies, hybrid assembly technology can detect a greater diversity of microbiota and antibiotic resistance genes, as well as a wider range of potential hosts. However, this approach may yield lower gene abundance and pathogen detection. Our study revealed the specific advantages and disadvantages of hybrid assembly and short- and long-read applications in wastewater treatment plants, and our approach could serve as a blueprint to be extended to terrestrial environments.
Collapse
Affiliation(s)
- Meng Liu
- College of Environment, Zhejiang University of Technology, Hangzhou, 310032, PR China
| | - Nuohan Xu
- College of Environment, Zhejiang University of Technology, Hangzhou, 310032, PR China
| | - Bingfeng Chen
- College of Environment, Zhejiang University of Technology, Hangzhou, 310032, PR China
| | - Zhenyan Zhang
- College of Environment, Zhejiang University of Technology, Hangzhou, 310032, PR China
| | - Xinyu Chen
- College of Environment, Zhejiang University of Technology, Hangzhou, 310032, PR China
| | - Yuke Zhu
- College of Environment, Zhejiang University of Technology, Hangzhou, 310032, PR China
| | - Wenjie Hong
- Key Laboratory of Microbial Technology and Bioinformatics of Zhejiang Province, Hangzhou, 310012, PR China
| | - Tingzhang Wang
- Key Laboratory of Microbial Technology and Bioinformatics of Zhejiang Province, Hangzhou, 310012, PR China
| | - Qi Zhang
- College of Environment, Zhejiang University of Technology, Hangzhou, 310032, PR China
| | - Yangqing Ye
- College of Mechanical Engineering, Zhejiang University of Technology, Hangzhou, 310032, PR China
| | - Tao Lu
- College of Environment, Zhejiang University of Technology, Hangzhou, 310032, PR China
| | - Haifeng Qian
- College of Environment, Zhejiang University of Technology, Hangzhou, 310032, PR China.
| |
Collapse
|
2
|
Díaz-Domínguez D, Leinonen M, Salmela L. Space-efficient computation of k-mer dictionaries for large values of k. Algorithms Mol Biol 2024; 19:14. [PMID: 38581000 PMCID: PMC10996146 DOI: 10.1186/s13015-024-00259-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2023] [Accepted: 03/02/2024] [Indexed: 04/07/2024] Open
Abstract
Computing k-mer frequencies in a collection of reads is a common procedure in many genomic applications. Several state-of-the-art k-mer counters rely on hash tables to carry out this task but they are often optimised for small k as a hash table keeping keys explicitly (i.e., k-mer sequences) takes O ( N k w ) computer words, N being the number of distinct k-mers and w the computer word size, which is impractical for long values of k. This space usage is an important limitation as analysis of long and accurate HiFi sequencing reads can require larger values of k. We propose Kaarme, a space-efficient hash table for k-mers using O ( N + u k w ) words of space, where u is the number of reads. Our framework exploits the fact that consecutive k-mers overlap by k - 1 symbols. Thus, we only store the last symbol of a k-mer and a pointer within the hash table to a previous one, which we can use to recover the remaining k - 1 symbols. We adapt Kaarme to compute canonical k-mers as well. This variant also uses pointers within the hash table to save space but requires more work to decode the k-mers. Specifically, it takes O ( σ k ) time in the worst case, σ being the DNA alphabet, but our experiments show this is hardly ever the case. The canonical variant does not improve our theoretical results but greatly reduces space usage in practice while keeping a competitive performance to get the k-mers and their frequencies. We compare canonical Kaarme to a regular hash table storing canonical k-mers explicitly as keys and show that our method uses up to five times less space while being less than 1.5 times slower. We also show that canonical Kaarme uses significantly less memory than state-of-the-art k-mer counters when they do not resort to disk to keep intermediate results.
Collapse
Affiliation(s)
- Diego Díaz-Domínguez
- Department of Computer Science, University of Helsinki, Pietari Kalmin katu 5, 00014, Helsinki, Finland.
| | - Miika Leinonen
- Department of Computer Science, University of Helsinki, Pietari Kalmin katu 5, 00014, Helsinki, Finland
| | - Leena Salmela
- Department of Computer Science, University of Helsinki, Pietari Kalmin katu 5, 00014, Helsinki, Finland.
| |
Collapse
|
3
|
Lopes F, Rossini M, Losacco F, Montanaro G, Gunter N, Tarasov S. Metagenomics reveals that dung beetles (Coleoptera: Scarabaeinae) broadly feed on reptile dung. Did they also feed on that of dinosaurs? Front Ecol Evol 2023. [DOI: 10.3389/fevo.2023.1132729] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/12/2023] Open
Abstract
The origin of the dung-feeding habits in dung beetles (Coleoptera: Scarabaeinae) is debatable. According to traditional views, the evolution of dung beetles (Coleoptera: Scarabaeinae) and their feeding habits are largely attributed to mammal dung. In this paper, we challenge this view and provide evidence that many dung beetle communities are actually attracted to the dung of reptiles and birds (= Sauropsida). In turn, this indicates that sauropsid dung may have played a crucial evolutionary role that was previously underestimated. We argue that it is physiologically realistic to consider that coprophagy in dung beetles could have evolved during the Cretaceous in response to the dung produced by dinosaurs. Furthermore, we demonstrate that sauropsid dung may be one of the major factors driving the emergence of insular dung beetle communities across the globe. We support our findings with amplicon-metagenomic analyses, field observations, and meta-analysis of the published literature.
Collapse
|
4
|
Tang T, Hutvagner G, Wang W, Li J. Simultaneous compression of multiple error-corrected short-read sets for faster data transmission and better de novo assemblies. Brief Funct Genomics 2022; 21:387-398. [PMID: 35848773 DOI: 10.1093/bfgp/elac016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 06/10/2022] [Accepted: 06/14/2022] [Indexed: 11/14/2022] Open
Abstract
Next-Generation Sequencing has produced incredible amounts of short-reads sequence data for de novo genome assembly over the last decades. For efficient transmission of these huge datasets, high-performance compression algorithms have been intensively studied. As both the de novo assembly and error correction methods utilize the overlaps between reads data, a concern is that the will the sequencing errors bring up negative effects on genome assemblies also affect the compression of the NGS data. This work addresses two problems: how current error correction algorithms can enable the compression algorithms to make the sequence data much more compact, and whether the sequence-modified reads by the error-correction algorithms will lead to quality improvement for de novo contig assembly. As multiple sets of short reads are often produced by a single biomedical project in practice, we propose a graph-based method to reorder the files in the collection of multiple sets and then compress them simultaneously for a further compression improvement after error correction. We use examples to illustrate that accurate error correction algorithms can significantly reduce the number of mismatched nucleotides in the reference-free compression, hence can greatly improve the compression performance. Extensive test on practical collections of multiple short-read sets does confirm that the compression performance on the error-corrected data (with unchanged size) significantly outperforms that on the original data, and that the file reordering idea contributes furthermore. The error correction on the original reads has also resulted in quality improvements of the genome assemblies, sometimes remarkably. However, it is still an open question that how to combine appropriate error correction methods with an assembly algorithm so that the assembly performance can be always significantly improved.
Collapse
Affiliation(s)
- Tao Tang
- Data Science Institute, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia.,School of Mordern Posts, Nanjing University of Posts and Telecommunications, 9 Wenyuan Rd, Qixia District, 210003, Jiangsu, China
| | - Gyorgy Hutvagner
- School of Biomedical Engineering, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia
| | - Wenjian Wang
- School of Computer and Information Technology, Shanxi University, Shanxi Road, 030006, Shanxi, China
| | - Jinyan Li
- Data Science Institute, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia
| |
Collapse
|
5
|
Eschenbrenner CJ, Feurtey A, Stukenbrock EH. Population Genomics of Fungal Plant Pathogens and the Analyses of Rapidly Evolving Genome Compartments. Methods Mol Biol 2021; 2090:337-355. [PMID: 31975174 DOI: 10.1007/978-1-0716-0199-0_14] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Genome sequencing of fungal pathogens have documented extensive variation in genome structure and composition between species and in many cases between individuals of the same species. This type of genomic variation can be adaptive for pathogens to rapidly evolve new virulence phenotypes. Analyses of genome-wide variation in fungal pathogen genomes rely on high quality assemblies and methods to detect and quantify structural variation. Population genomic studies in fungi have addressed the underlying mechanisms whereby structural variation can be rapidly generated. Transposable elements, high mutation and recombination rates as well as incorrect chromosome segregation during mitosis and meiosis contribute to extensive variation observed in many species. We here summarize key findings in the field of fungal pathogen genomics and we discuss methods to detect and characterize structural variants including an alignment-based pipeline to study variation in population genomic data.
Collapse
Affiliation(s)
- Christoph J Eschenbrenner
- Environmental Genomics, Christian-Albrechts University of Kiel, Kiel, Germany
- Max Planck Institute for Evolutionary Biology, Plön, Germany
| | - Alice Feurtey
- Environmental Genomics, Christian-Albrechts University of Kiel, Kiel, Germany
- Max Planck Institute for Evolutionary Biology, Plön, Germany
| | - Eva H Stukenbrock
- Environmental Genomics, Christian-Albrechts University of Kiel, Kiel, Germany.
- Max Planck Institute for Evolutionary Biology, Plön, Germany.
| |
Collapse
|
6
|
Firtina C, Kim JS, Alser M, Senol Cali D, Cicek AE, Alkan C, Mutlu O. Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm. Bioinformatics 2020; 36:3669-3679. [PMID: 32167530 DOI: 10.1093/bioinformatics/btaa179] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2019] [Revised: 12/16/2019] [Accepted: 03/11/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject's genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively. RESULTS We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Forward-Backward algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts. AVAILABILITY AND IMPLEMENTATION Source code is available at https://github.com/CMU-SAFARI/Apollo. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Can Firtina
- Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland
| | - Jeremie S Kim
- Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland.,Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Mohammed Alser
- Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland
| | - Damla Senol Cali
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - A Ercument Cicek
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
| | - Onur Mutlu
- Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland.,Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA.,Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
| |
Collapse
|
7
|
New insights on Pseudoalteromonas haloplanktis TAC125 genome organization and benchmarks of genome assembly applications using next and third generation sequencing technologies. Sci Rep 2019; 9:16444. [PMID: 31712730 PMCID: PMC6848147 DOI: 10.1038/s41598-019-52832-z] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Accepted: 10/23/2019] [Indexed: 12/21/2022] Open
Abstract
Pseudoalteromonas haloplanktis TAC125 is among the most commonly studied bacteria adapted to cold environments. Aside from its ecological relevance, P. haloplanktis has a potential use for biotechnological applications. Due to its importance, we decided to take advantage of next generation sequencing (Illumina) and third generation sequencing (PacBio and Oxford Nanopore) technologies to resequence its genome. The availability of a reference genome, obtained using whole genome shotgun sequencing, allowed us to study and compare the results obtained by the different technologies and draw useful conclusions for future de novo genome assembly projects. We found that assembly polishing using Illumina reads is needed to achieve a consensus accuracy over 99.9% when using Oxford Nanopore sequencing, but not in PacBio sequencing. However, the dependency of consensus accuracy on coverage is lower in Oxford Nanopore than in PacBio, suggesting that a cost-effective solution might be the use of low coverage Oxford Nanopore sequencing together with Illumina reads. Despite the differences in consensus accuracy, all sequencing technologies revealed the presence of a large plasmid, pMEGA, which was undiscovered until now. Among the most interesting features of pMEGA is the presence of a putative error-prone polymerase regulated through the SOS response. Aside from the characterization of the newly discovered plasmid, we confirmed the sequence of the small plasmid pMtBL and uncovered the presence of a potential partitioning system. Crucially, this study shows that the combination of next and third generation sequencing technologies give us an unprecedented opportunity to characterize our bacterial model organisms at a very detailed level.
Collapse
|
8
|
Firtina C, Bar-Joseph Z, Alkan C, Cicek AE. Hercules: a profile HMM-based hybrid error correction algorithm for long reads. Nucleic Acids Res 2019; 46:e125. [PMID: 30124947 PMCID: PMC6265270 DOI: 10.1093/nar/gky724] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2018] [Accepted: 08/07/2018] [Indexed: 01/15/2023] Open
Abstract
Choosing whether to use second or third generation sequencing platforms can lead to trade-offs between accuracy and read length. Several types of studies require long and accurate reads. In such cases researchers often combine both technologies and the erroneous long reads are corrected using the short reads. Current approaches rely on various graph or alignment based techniques and do not take the error profile of the underlying technology into account. Efficient machine learning algorithms that address these shortcomings have the potential to achieve more accurate integration of these two technologies. We propose Hercules, the first machine learning-based long read error correction algorithm. Hercules models every long read as a profile Hidden Markov Model with respect to the underlying platform’s error profile. The algorithm learns a posterior transition/emission probability distribution for each long read to correct errors in these reads. We show on two DNA-seq BAC clones (CH17-157L1 and CH17-227A2) that Hercules-corrected reads have the highest mapping rate among all competing algorithms and have the highest accuracy when the breadth of coverage is high. On a large human CHM1 cell line WGS data set, Hercules is one of the few scalable algorithms; and among those, it achieves the highest accuracy.
Collapse
Affiliation(s)
- Can Firtina
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
| | - Ziv Bar-Joseph
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
| | - A Ercument Cicek
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey.,Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| |
Collapse
|
9
|
GAAP: A Genome Assembly + Annotation Pipeline. BIOMED RESEARCH INTERNATIONAL 2019; 2019:4767354. [PMID: 31346518 PMCID: PMC6617929 DOI: 10.1155/2019/4767354] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 05/20/2019] [Accepted: 05/26/2019] [Indexed: 12/24/2022]
Abstract
Genomic analysis begins with de novo assembly of short-read fragments in order to reconstruct full-length base sequences without exploiting a reference genome sequence. Then, in the annotation step, gene locations are identified within the base sequences, and the structures and functions of these genes are determined. Recently, a wide range of powerful tools have been developed and published for whole-genome analysis, enabling even individual researchers in small laboratories to perform whole-genome analyses on their objects of interest. However, these analytical tools are generally complex and use diverse algorithms, parameter setting methods, and input formats; thus, it remains difficult for individual researchers to select, utilize, and combine these tools to obtain their final results. To resolve these issues, we have developed a genome analysis pipeline (GAAP) for semiautomated, iterative, and high-throughput analysis of whole-genome data. This pipeline is designed to perform read correction, de novo genome (transcriptome) assembly, gene prediction, and functional annotation using a range of proven tools and databases. We aim to assist non-IT researchers by describing each stage of analysis in detail and discussing current approaches. We also provide practical advice on how to access and use the bioinformatics tools and databases and how to implement the provided suggestions. Whole-genome analysis of Toxocara canis is used as case study to show intermediate results at each stage, demonstrating the practicality of the proposed method.
Collapse
|
10
|
Dal E, Alkan C. Evaluation of genome scaffolding tools using pooled clone sequencing. Turk J Biol 2019; 42:471-476. [PMID: 30983868 PMCID: PMC6451843 DOI: 10.3906/biy-1805-42] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
DNA sequencing technologies hold great promise in generating information that will guide scientists to understand how the genome effects human health and organismal evolution. The process of generating raw genome sequence data becomes cheaper and faster, but more error-prone. Assembly of such data into high-quality finished genome sequences remains challenging. Many genome assembly tools are available, but they differ in terms of their performance and their final output. More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. Here we evaluate the accuracies of several genome scaffolding algorithms using two different types of data generated from the genome of the same human individual: whole genome shotgun sequencing (WGS) and pooled clone sequencing (PCS). We observe that it is possible to obtain better assemblies if PCS data are used, compared to using only WGS data. However, the current scaffolding algorithms are developed only for WGS, and PCS-aware scaffolding algorithms remain an open problem.
Collapse
Affiliation(s)
- Elif Dal
- Department of Computer Engineering, Faculty of Engineering, Bilkent University , Ankara , Turkey
| | - Can Alkan
- Department of Computer Engineering, Faculty of Engineering, Bilkent University , Ankara , Turkey
| |
Collapse
|
11
|
Gias E, Brosnahan CL, Orr D, Binney B, Ha HJ, Preece MA, Jones B. In vivo growth and genomic characterization of rickettsia-like organisms isolated from farmed Chinook salmon (Oncorhynchus tshawytscha) in New Zealand. JOURNAL OF FISH DISEASES 2018; 41:1235-1245. [PMID: 29806079 DOI: 10.1111/jfd.12817] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/09/2018] [Revised: 03/26/2018] [Accepted: 04/04/2018] [Indexed: 06/08/2023]
Abstract
A rickettsia-like organism, designated NZ-RLO2, was isolated from Chinook salmon (Oncorhynchus tshawytscha) farmed in the South Island, New Zealand. In vivo growth showed NZ-RLO2 was able to grow in CHSE-214, EPC, BHK-21, C6/36 and Sf21 cell lines, while Piscirickettsia salmonis LF-89T grew in all but BHK-21 and Sf21. NZ-RLO2 grew optimally in EPC at 15°C, CHSE-214 and EPC at 18°C. The growth of LF-89 T was optimal at 15°C, 18°C and 22°C in CHSE-24, but appeared less efficient in EPC cells at all temperatures. Pan-genome comparison of predicted proteomes shows that available Chilean strains of P. salmonis grouped into two clusters (p-value = 94%). NZ-RLO2 was genetically different from previously described NZ-RLO1, and both strains grouped separately from the Chilean strains in one of the two clusters (p-value = 88%), but were closely related to each other. TaqMan and Sybr Green real-time PCR targeting RNA polymerase (rpoB) and DNA primase (dnaG), respectively, were developed to detect NZ-RLO2. This study indicates that the New Zealand strains showed a closer genetic relationship to one of the Chilean P. salmonis clusters; however, more Piscirickettsia genomes from wider geographical regions and diverse hosts are needed to better understand the classification within this genus.
Collapse
Affiliation(s)
- E Gias
- Animal Health Laboratory, Ministry for Primary Industries, Upper Hutt, New Zealand
| | - C L Brosnahan
- Animal Health Laboratory, Ministry for Primary Industries, Upper Hutt, New Zealand
| | - D Orr
- Animal Health Laboratory, Ministry for Primary Industries, Upper Hutt, New Zealand
| | - B Binney
- Animal Health Laboratory, Ministry for Primary Industries, Upper Hutt, New Zealand
| | - H J Ha
- Animal Health Laboratory, Ministry for Primary Industries, Upper Hutt, New Zealand
| | - M A Preece
- New Zealand King Salmon, Picton, New Zealand
| | - B Jones
- Animal Health Laboratory, Ministry for Primary Industries, Upper Hutt, New Zealand
- Murdoch University School of Veterinary and Life Sciences, Perth, WA, Australia
| |
Collapse
|
12
|
Hughes JA, Houghten S, Ashlock D. Restarting and recentering genetic algorithm variations for DNA fragment assembly: The necessity of a multi-strategy approach. Biosystems 2016; 150:35-45. [PMID: 27521768 DOI: 10.1016/j.biosystems.2016.08.001] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2015] [Revised: 05/09/2016] [Accepted: 08/01/2016] [Indexed: 11/30/2022]
Abstract
DNA Fragment assembly - an NP-Hard problem - is one of the major steps in of DNA sequencing. Multiple strategies have been used for this problem, including greedy graph-based algorithms, deBruijn graphs, and the overlap-layout-consensus approach. This study focuses on the overlap-layout-consensus approach. Heuristics and computational intelligence methods are combined to exploit their respective benefits. These algorithm combinations were able to produce high quality results surpassing the best results obtained by a number of competitive algorithms specially designed and tuned for this problem on thirteen of sixteen popular benchmarks. This work also reinforces the necessity of using multiple search strategies as it is clearly observed that algorithm performance is dependent on problem instance; without a deeper look into many searches, top solutions could be missed entirely.
Collapse
Affiliation(s)
- James Alexander Hughes
- Computer Science Department, Brock University, 500 Glenridge Ave., St. Catharines, Ontario L2S 3A1, Canada.
| | - Sheridan Houghten
- Computer Science Department, Brock University, 500 Glenridge Ave., St. Catharines, Ontario L2S 3A1, Canada
| | - Daniel Ashlock
- Department of Mathematics and Statistics, University of Guelph, 50 Stone Rd. E, Guelph, Ontario N1G 2W1, Canada
| |
Collapse
|
13
|
Butorac A, Mekić MS, Hozić A, Diminić J, Gamberger D, Nišavić M, Cindrić M. Benefits of selective peptide derivatization with sulfonating reagent at acidic pH for facile matrix-assisted laser desorption/ionization de novo sequencing. RAPID COMMUNICATIONS IN MASS SPECTROMETRY : RCM 2016; 30:1687-1694. [PMID: 28328037 DOI: 10.1002/rcm.7594] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/08/2016] [Revised: 04/23/2016] [Accepted: 04/24/2016] [Indexed: 06/06/2023]
Affiliation(s)
| | - Meliha Solak Mekić
- University Hospital Center Sestre Milosrdnice, University Hospital for Tumors, Center for Malignant Disease, Zagreb, Croatia
| | - Amela Hozić
- Centre for Proteomics and Mass Spectrometry and Laboratory for Information Systems, Ruđer Bošković Institute, Zagreb, Croatia
| | - Janko Diminić
- Faculty of Food Technology and Biotechnology, University of Zagreb, Zagreb, Croatia
| | - Dragan Gamberger
- Centre for Proteomics and Mass Spectrometry and Laboratory for Information Systems, Ruđer Bošković Institute, Zagreb, Croatia
| | - Marija Nišavić
- Laboratory of Physical Chemistry, Vinča Institute of Nuclear Sciences, University of Belgrade, Belgrade, Serbia
| | - Mario Cindrić
- Centre for Proteomics and Mass Spectrometry and Laboratory for Information Systems, Ruđer Bošković Institute, Zagreb, Croatia
| |
Collapse
|
14
|
El-Metwally S, Zakaria M, Hamza T. LightAssembler: fast and memory-efficient assembly algorithm for high-throughput sequencing reads. Bioinformatics 2016; 32:3215-3223. [PMID: 27412092 DOI: 10.1093/bioinformatics/btw470] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2016] [Accepted: 06/28/2016] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The deluge of current sequenced data has exceeded Moore's Law, more than doubling every 2 years since the next-generation sequencing (NGS) technologies were invented. Accordingly, we will able to generate more and more data with high speed at fixed cost, but lack the computational resources to store, process and analyze it. With error prone high throughput NGS reads and genomic repeats, the assembly graph contains massive amount of redundant nodes and branching edges. Most assembly pipelines require this large graph to reside in memory to start their workflows, which is intractable for mammalian genomes. Resource-efficient genome assemblers combine both the power of advanced computing techniques and innovative data structures to encode the assembly graph efficiently in a computer memory. RESULTS LightAssembler is a lightweight assembly algorithm designed to be executed on a desktop machine. It uses a pair of cache oblivious Bloom filters, one holding a uniform sample of [Formula: see text]-spaced sequenced [Formula: see text]-mers and the other holding [Formula: see text]-mers classified as likely correct, using a simple statistical test. LightAssembler contains a light implementation of the graph traversal and simplification modules that achieves comparable assembly accuracy and contiguity to other competing tools. Our method reduces the memory usage by [Formula: see text] compared to the resource-efficient assemblers using benchmark datasets from GAGE and Assemblathon projects. While LightAssembler can be considered as a gap-based sequence assembler, different gap sizes result in an almost constant assembly size and genome coverage. AVAILABILITY AND IMPLEMENTATION https://github.com/SaraEl-Metwally/LightAssembler CONTACT: sarah_almetwally4@mans.edu.egSupplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sara El-Metwally
- Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA Department of Computer Science, Faculty of Computers and Information, Mansoura University, Mansoura 35516, Egypt
| | - Magdi Zakaria
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, Mansoura 35516, Egypt
| | - Taher Hamza
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, Mansoura 35516, Egypt
| |
Collapse
|
15
|
Alic AS, Ruzafa D, Dopazo J, Blanquer I. Objective review of de novostand-alone error correction methods for NGS data. WILEY INTERDISCIPLINARY REVIEWS: COMPUTATIONAL MOLECULAR SCIENCE 2016. [DOI: 10.1002/wcms.1239] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Andy S. Alic
- Institute of Instrumentation for Molecular Imaging (I3M); Universitat Politècnica de València; València Spain
| | - David Ruzafa
- Departamento de Quìmica Fìsica e Instituto de Biotecnologìa, Facultad de Ciencias; Universidad de Granada; Granada Spain
| | - Joaquin Dopazo
- Department of Computational Genomics; Príncipe Felipe Research Centre (CIPF); Valencia Spain
- CIBER de Enfermedades Raras (CIBERER); Valencia Spain
- Functional Genomics Node (INB) at CIPF; Valencia Spain
| | - Ignacio Blanquer
- Institute of Instrumentation for Molecular Imaging (I3M); Universitat Politècnica de València; València Spain
- Biomedical Imaging Research Group GIBI 2; Polytechnic University Hospital La Fe; Valencia Spain
| |
Collapse
|
16
|
Laehnemann D, Borkhardt A, McHardy AC. Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction. Brief Bioinform 2016; 17:154-79. [PMID: 26026159 PMCID: PMC4719071 DOI: 10.1093/bib/bbv029] [Citation(s) in RCA: 179] [Impact Index Per Article: 22.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Revised: 04/09/2015] [Indexed: 12/23/2022] Open
Abstract
Characterizing the errors generated by common high-throughput sequencing platforms and telling true genetic variation from technical artefacts are two interdependent steps, essential to many analyses such as single nucleotide variant calling, haplotype inference, sequence assembly and evolutionary studies. Both random and systematic errors can show a specific occurrence profile for each of the six prominent sequencing platforms surveyed here: 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Illumina sequencing by synthesis, Ion Torrent semiconductor sequencing, Pacific Biosciences single-molecule real-time sequencing and Oxford Nanopore sequencing. There is a large variety of programs available for error removal in sequencing read data, which differ in the error models and statistical techniques they use, the features of the data they analyse, the parameters they determine from them and the data structures and algorithms they use. We highlight the assumptions they make and for which data types these hold, providing guidance which tools to consider for benchmarking with regard to the data properties. While no benchmarking results are included here, such specific benchmarks would greatly inform tool choices and future software development. The development of stand-alone error correctors, as well as single nucleotide variant and haplotype callers, could also benefit from using more of the knowledge about error profiles and from (re)combining ideas from the existing approaches presented here.
Collapse
|
17
|
Abstract
Background In highly parallel next-generation sequencing (NGS) techniques millions to billions of short reads are produced from a genomic sequence in a single run. Due to the limitation of the NGS technologies, there could be errors in the reads. The error rate of the reads can be reduced with trimming and by correcting the erroneous bases of the reads. It helps to achieve high quality data and the computational complexity of many biological applications will be greatly reduced if the reads are first corrected. We have developed a novel error correction algorithm called EC and compared it with four other state-of-the-art algorithms using both real and simulated sequencing reads. Results We have done extensive and rigorous experiments that reveal that EC is indeed an effective, scalable, and efficient error correction tool. Real reads that we have employed in our performance evaluation are Illumina-generated short reads of various lengths. Six experimental datasets we have utilized are taken from sequence and read archive (SRA) at NCBI. The simulated reads are obtained by picking substrings from random positions of reference genomes. To introduce errors, some of the bases of the simulated reads are changed to other bases with some probabilities. Conclusions Error correction is a vital problem in biology especially for NGS data. In this paper we present a novel algorithm, called Error Corrector (EC), for correcting substitution errors in biological sequencing reads. We plan to investigate the possibility of employing the techniques introduced in this research paper to handle insertion and deletion errors also. Software availability The implementation is freely available for non-commercial purposes. It can be downloaded from: http://engr.uconn.edu/~rajasek/EC.zip.
Collapse
|
18
|
Allam A, Kalnis P, Solovyev V. Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data. Bioinformatics 2015; 31:3421-8. [DOI: 10.1093/bioinformatics/btv415] [Citation(s) in RCA: 59] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2014] [Accepted: 07/08/2015] [Indexed: 11/12/2022] Open
|
19
|
Song L, Florea L, Langmead B. Lighter: fast and memory-efficient sequencing error correction without counting. Genome Biol 2015; 15:509. [PMID: 25398208 PMCID: PMC4248469 DOI: 10.1186/s13059-014-0509-9] [Citation(s) in RCA: 144] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2014] [Indexed: 02/02/2023] Open
Abstract
Lighter is a fast, memory-efficient tool for correcting sequencing errors. Lighter avoids counting k-mers. Instead, it uses a pair of Bloom filters, one holding a sample of the input k-mers and the other holding k-mers likely to be correct. As long as the sampling fraction is adjusted in inverse proportion to the depth of sequencing, Bloom filter size can be held constant while maintaining near-constant accuracy. Lighter is parallelized, uses no secondary storage, and is both faster and more memory-efficient than competing approaches while achieving comparable accuracy.
Collapse
|
20
|
Abstract
UNLABELLED BFC is a free, fast and easy-to-use sequencing error corrector designed for Illumina short reads. It uses a non-greedy algorithm but still maintains a speed comparable to implementations based on greedy methods. In evaluations on real data, BFC appears to correct more errors with fewer overcorrections in comparison to existing tools. It particularly does well in suppressing systematic sequencing errors, which helps to improve the base accuracy of de novo assemblies. AVAILABILITY AND IMPLEMENTATION https://github.com/lh3/bfc CONTACT hengli@broadinstitute.org SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Heng Li
- Medical Population Genetics Program, Broad Institute, Cambridge, MA 02142, USA
| |
Collapse
|
21
|
Ashton PM, Perry N, Ellis R, Petrovska L, Wain J, Grant KA, Jenkins C, Dallman TJ. Insight into Shiga toxin genes encoded by Escherichia coli O157 from whole genome sequencing. PeerJ 2015; 3:e739. [PMID: 25737808 PMCID: PMC4338798 DOI: 10.7717/peerj.739] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2014] [Accepted: 01/05/2015] [Indexed: 11/20/2022] Open
Abstract
The ability of Shiga toxin-producing Escherichia coli (STEC) to cause severe illness in humans is determined by multiple host factors and bacterial characteristics, including Shiga toxin (Stx) subtype. Given the link between Stx2a subtype and disease severity, we sought to identify the stx subtypes present in whole genome sequences (WGS) of 444 isolates of STEC O157. Difficulties in assembling the stx genes in some strains were overcome by using two complementary bioinformatics methods: mapping and de novo assembly. We compared the WGS analysis with the results obtained using a PCR approach and investigated the diversity within and between the subtypes. All strains of STEC O157 in this study had stx1a, stx2a or stx2c or a combination of these three genes. There was over 99% (442/444) concordance between PCR and WGS. When common source strains were excluded, 236/349 strains of STEC O157 had multiple copies of different Stx subtypes and 54 had multiple copies of the same Stx subtype. Of those strains harbouring multiple copies of the same Stx subtype, 33 had variants between the alleles while 21 had identical copies. Strains harbouring Stx2a only were most commonly found to have multiple alleles of the same subtype (42%). Both the PCR and WGS approach to stx subtyping provided a good level of sensitivity and specificity. In addition, the WGS data also showed there were a significant proportion of strains harbouring multiple alleles of the same Stx subtype associated with clinical disease in England.
Collapse
Affiliation(s)
- Philip M Ashton
- Gastrointestinal Bacteria Reference Unit, Public Health England , London , UK
| | - Neil Perry
- Gastrointestinal Bacteria Reference Unit, Public Health England , London , UK
| | - Richard Ellis
- Animal & Plant Health Agency , New Haw, Addlestone, Surrey , UK
| | | | - John Wain
- University of East Anglia , Norwich Research Park, Norwich , UK
| | - Kathie A Grant
- Gastrointestinal Bacteria Reference Unit, Public Health England , London , UK
| | - Claire Jenkins
- Gastrointestinal Bacteria Reference Unit, Public Health England , London , UK
| | - Tim J Dallman
- Gastrointestinal Bacteria Reference Unit, Public Health England , London , UK
| |
Collapse
|
22
|
Black JS, Salto-Tellez M, Mills KI, Catherwood MA. The impact of next generation sequencing technologies on haematological research – A review. ACTA ACUST UNITED AC 2015. [DOI: 10.1016/j.pathog.2015.05.004] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
|
23
|
Pu D, Qi Y, Cui L, Xiao P, Lu Z. A real-time decoding sequencing based on dual mononucleotide addition for cyclic synthesis. Anal Chim Acta 2014; 852:274-83. [PMID: 25441908 DOI: 10.1016/j.aca.2014.09.009] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2014] [Revised: 08/28/2014] [Accepted: 09/08/2014] [Indexed: 11/19/2022]
Abstract
We propose a real-time decoding sequencing strategy in which a template is determined without directly measuring base sequence but by decoding two sets of encodings obtained from two parallel sequencing runs. This strategy relies on adding a mixture of different two-base pair, A+G, C+T, A+C, G+T, A+T or C+G (abbreviated as AG, CT, AC, GT, AT, or CG), into the reaction each time. When a template is cyclically interrogated twice with any two kinds of dual mononucleotide addition (AG/CT, AC/GT, and AT/CG), two sets of encodings are obtained sequentially. The two sets of encodings allow for the bases to be sequentially decoded, moving from first to last, in a deterministic manner. This strategy applies fewer cycles to obtain longer read length compared to the traditional real-time sequencing strategy. Partial rnpB gene was applied to verify the applicability of the decoding strategy via pyrosequencing. The results indicated that the sequence could be reconstructed by decoding two sets of encodings. Moreover, streptococcal strains could be differentiated by comparing signal intensity in each cycle and encoding size of each template. This strategy is likely to be applied to differentiate nucleic acid sequence as encoding size and signal intensity in each cycle vary with the base size and composition. Furthermore, it has the potential in building a promising strategy that could be utilized as an alternative to conventional sequencing systems.
Collapse
Affiliation(s)
- Dan Pu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
| | - Yuhua Qi
- Jiangsu Provincial Center for Disease Control and Prevention, Nanjing, Jiangsu 210009, China
| | - Lunbiao Cui
- Jiangsu Provincial Center for Disease Control and Prevention, Nanjing, Jiangsu 210009, China
| | - Pengfeng Xiao
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China.
| | - Zuhong Lu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China.
| |
Collapse
|
24
|
Molnar M, Ilie L. Correcting Illumina data. Brief Bioinform 2014; 16:588-99. [PMID: 25183248 DOI: 10.1093/bib/bbu029] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2014] [Accepted: 08/02/2014] [Indexed: 11/12/2022] Open
Abstract
Next-generation sequencing technologies revolutionized the ways in which genetic information is obtained and have opened the door for many essential applications in biomedical sciences. Hundreds of gigabytes of data are being produced, and all applications are affected by the errors in the data. Many programs have been designed to correct these errors, most of them targeting the data produced by the dominant technology of Illumina. We present a thorough comparison of these programs. Both HiSeq and MiSeq types of Illumina data are analyzed, and correcting performance is evaluated as the gain in depth and breadth of coverage, as given by correct reads and k-mers. Time and memory requirements, scalability and parallelism are considered as well. Practical guidelines are provided for the effective use of these tools. We also evaluate the efficiency of the current state-of-the-art programs for correcting Illumina data and provide research directions for further improvement.
Collapse
|
25
|
Abstract
Motivation: PacBio single molecule real-time sequencing is a third-generation sequencing technique producing long reads, with comparatively lower throughput and higher error rate. Errors include numerous indels and complicate downstream analysis like mapping or de novo assembly. A hybrid strategy that takes advantage of the high accuracy of second-generation short reads has been proposed for correcting long reads. Mapping of short reads on long reads provides sufficient coverage to eliminate up to 99% of errors, however, at the expense of prohibitive running times and considerable amounts of disk and memory space. Results: We present LoRDEC, a hybrid error correction method that builds a succinct de Bruijn graph representing the short reads, and seeks a corrective sequence for each erroneous region in the long reads by traversing chosen paths in the graph. In comparison, LoRDEC is at least six times faster and requires at least 93% less memory or disk space than available tools, while achieving comparable accuracy. Availability and implementaion: LoRDEC is written in C++, tested on Linux platforms and freely available at http://atgc.lirmm.fr/lordec. Contact:lordec@lirmm.fr. Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Leena Salmela
- Department of Computer Science and Helsinki Institute for Information Technology HIIT, FI-00014 University of Helsinki, Finland and LIRMM and Institut de Biologie Computationelle, CNRS and Université Montpellier, 34095 Montpellier Cedex 5, France
| | - Eric Rivals
- Department of Computer Science and Helsinki Institute for Information Technology HIIT, FI-00014 University of Helsinki, Finland and LIRMM and Institut de Biologie Computationelle, CNRS and Université Montpellier, 34095 Montpellier Cedex 5, France
| |
Collapse
|
26
|
Wirawan A, Harris RS, Liu Y, Schmidt B, Schröder J. HECTOR: a parallel multistage homopolymer spectrum based error corrector for 454 sequencing data. BMC Bioinformatics 2014; 15:131. [PMID: 24885381 PMCID: PMC4023493 DOI: 10.1186/1471-2105-15-131] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2013] [Accepted: 04/24/2014] [Indexed: 01/29/2023] Open
Abstract
BACKGROUND Current-generation sequencing technologies are able to produce low-cost, high-throughput reads. However, the produced reads are imperfect and may contain various sequencing errors. Although many error correction methods have been developed in recent years, none explicitly targets homopolymer-length errors in the 454 sequencing reads. RESULTS We present HECTOR, a parallel multistage homopolymer spectrum based error corrector for 454 sequencing data. In this algorithm, for the first time we have investigated a novel homopolymer spectrum based approach to handle homopolymer insertions or deletions, which are the dominant sequencing errors in 454 pyrosequencing reads. We have evaluated the performance of HECTOR, in terms of correction quality, runtime and parallel scalability, using both simulated and real pyrosequencing datasets. This performance has been further compared to that of Coral, a state-of-the-art error corrector which is based on multiple sequence alignment and Acacia, a recently published error corrector for amplicon pyrosequences. Our evaluations reveal that HECTOR demonstrates comparable correction quality to Coral, but runs 3.7× faster on average. In addition, HECTOR performs well even when the coverage of the dataset is low. CONCLUSION Our homopolymer spectrum based approach is theoretically capable of processing arbitrary-length homopolymer-length errors, with a linear time complexity. HECTOR employs a multi-threaded design based on a master-slave computing model. Our experimental results show that HECTOR is a practical 454 pyrosequencing read error corrector which is competitive in terms of both correction quality and speed. The source code and all simulated data are available at: http://hector454.sourceforge.net.
Collapse
Affiliation(s)
- Adrianto Wirawan
- Institut für Informatik, Johannes Gutenberg Universität Mainz, Mainz, Germany.
| | | | | | | | | |
Collapse
|
27
|
Lee H, Popodi E, Foster PL, Tang H. Detection of structural variants involving repetitive regions in the reference genome. J Comput Biol 2014; 21:219-33. [PMID: 24552580 DOI: 10.1089/cmb.2013.0129] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
Next-generation sequencing techniques are now commonly used to characterize structural variations (SVs) in population genomics and elucidate their associations with phenotypes. Many of the computational tools developed for detecting structural variations work by mapping paired-end reads to a reference genome and identifying the discordant read-pairs whose mapped loci in the reference genome deviate from the expected insert size and orientation. However, repetitive regions in the reference genome represent a major challenge in SV detection, because the paired-end reads from these regions may be mapped to multiple loci in the reference genome, resulting in spuriously discordant read-pairs. To address this issue, we have developed an algorithmic approach for read mapping and SV detection based on the framework of A-Bruijn graphs. Instead of mapping reads to a linear sequence of the reference genome, we propose to map reads onto the A-Bruijn graph constructed from the reference genome in which all instances of the same repeat are collapsed into a single edge. As a result, any given read, either from repetitive regions or not, will be mapped to a unique location in the A-Bruijn graph, and each discordant read-pair in the A-Bruijn graph indicates a potentially true SV event. We also developed a simple clustering algorithm to derive valid clusters of these discordant read-pairs, each supporting a different SV event. Finally, we demonstrate the performance of this approach, compared to existing approaches, by identifying transposition events of insertion sequence (IS) elements, a class of simple mobile genetic elements (MGEs), in E. coli by using simulated and real paired-end sequence data acquired from E. coli mutation accumulation lines.
Collapse
Affiliation(s)
- Heewook Lee
- 1 School of Informatics and Computing, Indiana University , Bloomington, Indiana
| | | | | | | |
Collapse
|
28
|
El-Metwally S, Ouda OM, Helmy M. Approaches and Challenges of Next-Generation Sequence Assembly Stages. NEXT GENERATION SEQUENCING TECHNOLOGIES AND CHALLENGES IN SEQUENCE ASSEMBLY 2014. [DOI: 10.1007/978-1-4939-0715-1_9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
29
|
Abstract
The genomic revolution promises great advances in the search for useful biocatalysts. Function-based metagenomic approaches have identified several enzymes with properties that make them useful candidates for a variety of bioprocesses. As DNA sequencing costs continue to decline, the volume of genomic data, along with their corresponding predicted protein sequences, will continue to increase dramatically, necessitating new approaches to leverage this information for gene-based bioprospecting efforts. Additionally, as new functions are discovered and correlated with this sequence information, the knowledge of the often complex relationship between a protein's sequence and function will improve. This in turn will lead to better gene-based bioprospecting approaches and facilitate the tailoring of desired properties through protein engineering projects. In this chapter, we discuss a number of recent advances in bioprospecting within the context of the genomic age.
Collapse
Affiliation(s)
- Michael A Hicks
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Kristala L J Prather
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA; Synthetic Biology Engineering Research Center (SynBERC), Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.
| |
Collapse
|
30
|
El-Metwally S, Hamza T, Zakaria M, Helmy M. Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput Biol 2013; 9:e1003345. [PMID: 24348224 PMCID: PMC3861042 DOI: 10.1371/journal.pcbi.1003345] [Citation(s) in RCA: 68] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
Decoding DNA symbols using next-generation sequencers was a major breakthrough in genomic research. Despite the many advantages of next-generation sequencers, e.g., the high-throughput sequencing rate and relatively low cost of sequencing, the assembly of the reads produced by these sequencers still remains a major challenge. In this review, we address the basic framework of next-generation genome sequence assemblers, which comprises four basic stages: preprocessing filtering, a graph construction process, a graph simplification process, and postprocessing filtering. Here we discuss them as a framework of four stages for data analysis and processing and survey variety of techniques, algorithms, and software tools used during each stage. We also discuss the challenges that face current assemblers in the next-generation environment to determine the current state-of-the-art. We recommend a layered architecture approach for constructing a general assembler that can handle the sequences generated by different sequencing platforms.
Collapse
Affiliation(s)
- Sara El-Metwally
- Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Taher Hamza
- Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Magdi Zakaria
- Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Mohamed Helmy
- Botany Department, Faculty of Agriculture, Al-Azhar University, Cairo, Egypt
- Biotechnology Department, Faculty of Agriculture, Al-Azhar University, Cairo, Egypt
| |
Collapse
|
31
|
Predominant Acidilobus-like populations from geothermal environments in yellowstone national park exhibit similar metabolic potential in different hypoxic microbial communities. Appl Environ Microbiol 2013; 80:294-305. [PMID: 24162572 DOI: 10.1128/aem.02860-13] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
High-temperature (>70°C) ecosystems in Yellowstone National Park (YNP) provide an unparalleled opportunity to study chemotrophic archaea and their role in microbial community structure and function under highly constrained geochemical conditions. Acidilobus spp. (order Desulfurococcales) comprise one of the dominant phylotypes in hypoxic geothermal sulfur sediment and Fe(III)-oxide environments along with members of the Thermoproteales and Sulfolobales. Consequently, the primary goals of the current study were to analyze and compare replicate de novo sequence assemblies of Acidilobus-like populations from four different mildly acidic (pH 3.3 to 6.1) high-temperature (72°C to 82°C) environments and to identify metabolic pathways and/or protein-encoding genes that provide a detailed foundation of the potential functional role of these populations in situ. De novo assemblies of the highly similar Acidilobus-like populations (>99% 16S rRNA gene identity) represent near-complete consensus genomes based on an inventory of single-copy genes, deduced metabolic potential, and assembly statistics generated across sites. Functional analysis of coding sequences and confirmation of gene transcription by Acidilobus-like populations provide evidence that they are primarily chemoorganoheterotrophs, generating acetyl coenzyme A (acetyl-CoA) via the degradation of carbohydrates, lipids, and proteins, and auxotrophic with respect to several external vitamins, cofactors, and metabolites. No obvious pathways or protein-encoding genes responsible for the dissimilatory reduction of sulfur were identified. The presence of a formate dehydrogenase (Fdh) and other protein-encoding genes involved in mixed-acid fermentation supports the hypothesis that Acidilobus spp. function as degraders of complex organic constituents in high-temperature, mildly acidic, hypoxic geothermal systems.
Collapse
|
32
|
Rapid quantification of sequence repeats to resolve the size, structure and contents of bacterial genomes. BMC Genomics 2013; 14:537. [PMID: 23924250 PMCID: PMC3751351 DOI: 10.1186/1471-2164-14-537] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2013] [Accepted: 08/03/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The numerous classes of repeats often impede the assembly of genome sequences from the short reads provided by new sequencing technologies. We demonstrate a simple and rapid means to ascertain the repeat structure and total size of a bacterial or archaeal genome without the need for assembly by directly analyzing the abundances of distinct k-mers among reads. RESULTS The sensitivity of this procedure to resolve variation within a bacterial species is demonstrated: genome sizes and repeat structure of five environmental strains of E. coli from short Illumina reads were estimated by this method, and total genome sizes corresponded well with those obtained for the same strains by pulsed-field gel electrophoresis. In addition, this approach was applied to read-sets for completed genomes and shown to be accurate over a wide range of microbial genome sizes. CONCLUSIONS Application of these procedures, based solely on k-mer abundances in short read data sets, allows aspects of genome structure to be resolved that are not apparent from conventional short read assemblies. This knowledge of the repetitive content of genomes provides insights into genome evolution and diversity.
Collapse
|
33
|
MacManes MD, Eisen MB. Improving transcriptome assembly through error correction of high-throughput sequence reads. PeerJ 2013; 1:e113. [PMID: 23904992 PMCID: PMC3728768 DOI: 10.7717/peerj.113] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2013] [Accepted: 07/03/2013] [Indexed: 01/20/2023] Open
Abstract
The study of functional genomics, particularly in non-model organisms, has been dramatically improved over the last few years by the use of transcriptomes and RNAseq. While these studies are potentially extremely powerful, a computationally intensive procedure, the de novo construction of a reference transcriptome must be completed as a prerequisite to further analyses. The accurate reference is critically important as all downstream steps, including estimating transcript abundance are critically dependent on the construction of an accurate reference. Though a substantial amount of research has been done on assembly, only recently have the pre-assembly procedures been studied in detail. Specifically, several stand-alone error correction modules have been reported on and, while they have shown to be effective in reducing errors at the level of sequencing reads, how error correction impacts assembly accuracy is largely unknown. Here, we show via use of a simulated and empiric dataset, that applying error correction to sequencing reads has significant positive effects on assembly accuracy, and should be applied to all datasets. A complete collection of commands which will allow for the production of Reptile corrected reads is available at https://github.com/macmanes/error_correction/tree/master/scripts and as File S1.
Collapse
Affiliation(s)
- Matthew D. MacManes
- California Institute for Quantitative Biosciences, University of California, Berkeley, CA, USA
| | - Michael B. Eisen
- California Institute for Quantitative Biosciences, University of California, Berkeley, CA, USA
- Howard Hughes Medical Institute, USA
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA
| |
Collapse
|
34
|
Abstract
MOTIVATION High-throughput next-generation sequencing technologies enable increasingly fast and affordable sequencing of genomes and transcriptomes, with a broad range of applications. The quality of the sequencing data is crucial for all applications. A significant portion of the data produced contains errors, and ever more efficient error correction programs are needed. RESULTS We propose RACER (Rapid and Accurate Correction of Errors in Reads), a new software program for correcting errors in sequencing data. RACER has better error-correcting performance than existing programs, is faster and requires less memory. To support our claims, we performed extensive comparison with the existing leading programs on a variety of real datasets. AVAILABILITY RACER is freely available for non-commercial use at www.csd.uwo.ca/∼ilie/RACER/.
Collapse
Affiliation(s)
- Lucian Ilie
- Department of Computer Science, University of Western Ontario, N6A 5B7 London, ON, Canada
| | | |
Collapse
|
35
|
Forde BM, O'Toole PW. Next-generation sequencing technologies and their impact on microbial genomics. Brief Funct Genomics 2013; 12:440-53. [PMID: 23314033 DOI: 10.1093/bfgp/els062] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Next-generation sequencing technologies have had a dramatic impact in the field of genomic research through the provision of a low cost, high-throughput alternative to traditional capillary sequencers. These new sequencing methods have surpassed their original scope and now provide a range of utility-based applications, which allow for a more comprehensive analysis of the structure and content of microbial genomes than was previously possible. With the commercialization of a third generation of sequencing technologies imminent, we discuss the applications of current next-generation sequencing methods and explore their impact on and contribution to microbial genome research.
Collapse
Affiliation(s)
- Brian M Forde
- Department of Microbiology, University College Cork, Cork, Ireland.
| | | |
Collapse
|
36
|
Liu Y, Schröder J, Schmidt B. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. ACTA ACUST UNITED AC 2012. [PMID: 23202746 DOI: 10.1093/bioinformatics/bts690] [Citation(s) in RCA: 175] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
MOTIVATION The imperfect sequence data produced by next-generation sequencing technologies have motivated the development of a number of short-read error correctors in recent years. The majority of methods focus on the correction of substitution errors, which are the dominant error source in data produced by Illumina sequencing technology. Existing tools either score high in terms of recall or precision but not consistently high in terms of both measures. RESULTS In this article, we present Musket, an efficient multistage k-mer-based corrector for Illumina short-read data. We use the k-mer spectrum approach and introduce three correction techniques in a multistage workflow: two-sided conservative correction, one-sided aggressive correction and voting-based refinement. Our performance evaluation results, in terms of correction quality and de novo genome assembly measures, reveal that Musket is consistently one of the top performing correctors. In addition, Musket is multi-threaded using a master-slave model and demonstrates superior parallel scalability compared with all other evaluated correctors as well as a highly competitive overall execution time. AVAILABILITY Musket is available at http://musket.sourceforge.net.
Collapse
Affiliation(s)
- Yongchao Liu
- Institut für Informatik, Johannes Gutenberg Universität Mainz, Mainz 55099, Germany.
| | | | | |
Collapse
|
37
|
Bashir A, Klammer A, Robins WP, Chin CS, Webster D, Paxinos E, Hsu D, Ashby M, Wang S, Peluso P, Sebra R, Sorenson J, Bullard J, Yen J, Valdovino M, Mollova E, Luong K, Lin S, LaMay B, Joshi A, Rowe L, Frace M, Tarr CL, Turnsek M, Davis BM, Kasarskis A, Mekalanos JJ, Waldor MK, Schadt EE. A hybrid approach for the automated finishing of bacterial genomes. Nat Biotechnol 2012; 30:701-707. [PMID: 22750883 DOI: 10.1038/nbt.2288] [Citation(s) in RCA: 158] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2012] [Accepted: 05/30/2012] [Indexed: 02/08/2023]
Abstract
Advances in DNA sequencing technology have improved our ability to characterize most genomic diversity. However, accurate resolution of large structural events is challenging because of the short read lengths of second-generation technologies. Third-generation sequencing technologies, which can yield longer multikilobase reads, have the potential to address limitations associated with genome assembly. Here we combine sequencing data from second- and third-generation DNA sequencing technologies to assemble the two-chromosome genome of a recent Haitian cholera outbreak strain into two nearly finished contigs at >99.9% accuracy. Complex regions with clinically relevant structure were completely resolved. In separate control assemblies on experimental and simulated data for the canonical N16961 cholera reference strain, we obtained 14 scaffolds of greater than 1 kb for the experimental data and 8 scaffolds of greater than 1 kb for the simulated data, which allowed us to correct several errors in contigs assembled from the short-read data alone. This work provides a blueprint for the next generation of rapid microbial identification and full-genome assembly.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Lori Rowe
- National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, Atlanta GA 30333
| | - Michael Frace
- National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, Atlanta GA 30333
| | - Cheryl L Tarr
- National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, Atlanta GA 30333
| | - Maryann Turnsek
- National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, Atlanta GA 30333
| | - Brigid M Davis
- Channing Laboratory, Brigham and Women's Hospital, Boston, MA.,Department of Medicine, Harvard Medical School, Boston, MA.,Department of Microbiology and Molecular Genetics, Harvard Medical School, Boston, MA.,Howard Hughes Medical Institute, Boston, MA
| | | | | | - Matthew K Waldor
- Channing Laboratory, Brigham and Women's Hospital, Boston, MA.,Department of Medicine, Harvard Medical School, Boston, MA.,Department of Microbiology and Molecular Genetics, Harvard Medical School, Boston, MA.,Howard Hughes Medical Institute, Boston, MA
| | - Eric E Schadt
- Pacific Biosciences, Menlo Park, CA.,Department of Genetics and Genomic Sciences, Mount Sinai School of Medicine, New York City
| |
Collapse
|
38
|
Skums P, Dimitrova Z, Campo DS, Vaughan G, Rossi L, Forbi JC, Yokosawa J, Zelikovsky A, Khudyakov Y. Efficient error correction for next-generation sequencing of viral amplicons. BMC Bioinformatics 2012; 13 Suppl 10:S6. [PMID: 22759430 PMCID: PMC3382444 DOI: 10.1186/1471-2105-13-s10-s6] [Citation(s) in RCA: 79] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
Background Next-generation sequencing allows the analysis of an unprecedented number of viral sequence variants from infected patients, presenting a novel opportunity for understanding virus evolution, drug resistance and immune escape. However, sequencing in bulk is error prone. Thus, the generated data require error identification and correction. Most error-correction methods to date are not optimized for amplicon analysis and assume that the error rate is randomly distributed. Recent quality assessment of amplicon sequences obtained using 454-sequencing showed that the error rate is strongly linked to the presence and size of homopolymers, position in the sequence and length of the amplicon. All these parameters are strongly sequence specific and should be incorporated into the calibration of error-correction algorithms designed for amplicon sequencing. Results In this paper, we present two new efficient error correction algorithms optimized for viral amplicons: (i) k-mer-based error correction (KEC) and (ii) empirical frequency threshold (ET). Both were compared to a previously published clustering algorithm (SHORAH), in order to evaluate their relative performance on 24 experimental datasets obtained by 454-sequencing of amplicons with known sequences. All three algorithms show similar accuracy in finding true haplotypes. However, KEC and ET were significantly more efficient than SHORAH in removing false haplotypes and estimating the frequency of true ones. Conclusions Both algorithms, KEC and ET, are highly suitable for rapid recovery of error-free haplotypes obtained by 454-sequencing of amplicons from heterogeneous viruses. The implementations of the algorithms and data sets used for their testing are available at: http://alan.cs.gsu.edu/NGS/?q=content/pyrosequencing-error-correction-algorithm
Collapse
Affiliation(s)
- Pavel Skums
- Laboratory of Molecular Epidemiology and Bioinformatics, Division of Viral Hepatitis, Centers for Disease Control and Prevention, 1600 Clifton Road NE, Atlanta, GA 30333, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
39
|
Abstract
Background The very large memory requirements for the construction of assembly graphs for de novo genome assembly limit current algorithms to super-computing environments. Methods In this paper, we demonstrate that constructing a sparse assembly graph which stores only a small fraction of the observed k-mers as nodes and the links between these nodes allows the de novo assembly of even moderately-sized genomes (~500 M) on a typical laptop computer. Results We implement this sparse graph concept in a proof-of-principle software package, SparseAssembler, utilizing a new sparse k-mer graph structure evolved from the de Bruijn graph. We test our SparseAssembler with both simulated and real data, achieving ~90% memory savings and retaining high assembly accuracy, without sacrificing speed in comparison to existing de novo assemblers.
Collapse
Affiliation(s)
- Chengxi Ye
- Ecology & Evolution of Plant-Animal Interaction Group, Xishuangbanna Tropical Botanic Garden, Chinese Academy of Sciences, Menglun, Yunnan 666303 China.
| | | | | | | | | |
Collapse
|
40
|
Yang X, Chockalingam SP, Aluru S. A survey of error-correction methods for next-generation sequencing. Brief Bioinform 2012; 14:56-66. [DOI: 10.1093/bib/bbs015] [Citation(s) in RCA: 177] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
41
|
Pellin D, Miotto P, Ambrosi A, Cirillo DM, Di Serio C. A genome-wide identification analysis of small regulatory RNAs in Mycobacterium tuberculosis by RNA-Seq and conservation analysis. PLoS One 2012; 7:e32723. [PMID: 22470422 PMCID: PMC3314655 DOI: 10.1371/journal.pone.0032723] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2011] [Accepted: 02/03/2012] [Indexed: 12/29/2022] Open
Abstract
We propose a new method for smallRNAs (sRNAs) identification. First we build an effective target genome (ETG) by means of a strand-specific procedure. Then we propose a new bioinformatic pipeline based mainly on the combination of two types of information: the first provides an expression map based on RNA-seq data (Reads Map) and the second applies principles of comparative genomics leading to a Conservation Map. By superimposing these two maps, a robust method for the search of sRNAs is obtained. We apply this methodology to investigate sRNAs in Mycobacterium tuberculosis H37Rv. This bioinformatic procedure leads to a total list of 1948 candidate sRNAs. The size of the candidate list is strictly related to the aim of the study and to the technology used during the verification process. We provide performance measures of the algorithm in identifying annotated sRNAs reported in three recent published studies.
Collapse
Affiliation(s)
- Danilo Pellin
- University Centre for Statistics in the Biomedical Sciences, Università Vita-Salute San Raffaele, Milan, Italy
| | - Paolo Miotto
- Emerging Bacterial Pathogens Unit, San Raffaele Scientific Institute, Milan, Italy
| | - Alessandro Ambrosi
- University Centre for Statistics in the Biomedical Sciences, Università Vita-Salute San Raffaele, Milan, Italy
| | | | - Clelia Di Serio
- University Centre for Statistics in the Biomedical Sciences, Università Vita-Salute San Raffaele, Milan, Italy
- * E-mail:
| |
Collapse
|
42
|
Gene fragmentation in bacterial draft genomes: extent, consequences and mitigation. BMC Genomics 2012; 13:14. [PMID: 22233127 PMCID: PMC3322347 DOI: 10.1186/1471-2164-13-14] [Citation(s) in RCA: 51] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2011] [Accepted: 01/10/2012] [Indexed: 12/11/2022] Open
Abstract
UNLABELLED Ongoing technological advances in genome sequencing are allowing bacterial genomes to be sequenced at ever-lower cost. However, nearly all of these new techniques concomitantly decrease genome quality, primarily due to the inability of their relatively short read lengths to bridge certain genomic regions, e.g., those containing repeats. Fragmentation of predicted open reading frames (ORFs) is one possible consequence of this decreased quality. In this study we quantify ORF fragmentation in draft microbial genomes and its effect on annotation efficacy, and we propose a solution to ameliorate this problem. RESULTS A survey of draft-quality genomes in GenBank revealed that fragmented ORFs comprised > 80% of the predicted ORFs in some genomes, and that increased fragmentation correlated with decreased genome assembly quality. In a more thorough analysis of 25 Streptomyces genomes, fragmentation was especially enriched in some protein classes with repeating, multi-modular structures such as polyketide synthases, non-ribosomal peptide synthetases and serine/threonine kinases. Overall, increased genome fragmentation correlated with increased false-negative Pfam and COG annotation rates and increased false-positive KEGG annotation rates. The false-positive KEGG annotation rate could be ameliorated by linking fragmented ORFs using their orthologs in related genomes. Whereas this strategy successfully linked up to 46% of the total ORF fragments in some genomes, its sensitivity appeared to depend heavily on the depth of sampling of a particular taxon's variable genome. CONCLUSIONS Draft microbial genomes contain many ORF fragments. Where these correspond to the same gene they have particular potential to confound comparative gene content analyses. Given our findings, and the rapid increase in the number of microbial draft quality genomes, we suggest that accounting for gene fragmentation and its associated biases is important when designing comparative genomic projects.
Collapse
|
43
|
Chapman JA, Ho I, Sunkara S, Luo S, Schroth GP, Rokhsar DS. Meraculous: de novo genome assembly with short paired-end reads. PLoS One 2011; 6:e23501. [PMID: 21876754 PMCID: PMC3158087 DOI: 10.1371/journal.pone.0023501] [Citation(s) in RCA: 125] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2010] [Accepted: 07/19/2011] [Indexed: 11/18/2022] Open
Abstract
We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ∼280 bp or ∼3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed.
Collapse
Affiliation(s)
- Jarrod A Chapman
- U.S. Department of Energy Joint Genome Institute, Walnut Creek, California, United States of America.
| | | | | | | | | | | |
Collapse
|
44
|
Genome sequence and analysis of the tuber crop potato. Nature 2011; 475:189-95. [PMID: 21743474 DOI: 10.1038/nature10158] [Citation(s) in RCA: 1198] [Impact Index Per Article: 92.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2011] [Accepted: 05/03/2011] [Indexed: 02/03/2023]
Abstract
Potato (Solanum tuberosum L.) is the world's most important non-grain food crop and is central to global food security. It is clonally propagated, highly heterozygous, autotetraploid, and suffers acute inbreeding depression. Here we use a homozygous doubled-monoploid potato clone to sequence and assemble 86% of the 844-megabase genome. We predict 39,031 protein-coding genes and present evidence for at least two genome duplication events indicative of a palaeopolyploid origin. As the first genome sequence of an asterid, the potato genome reveals 2,642 genes specific to this large angiosperm clade. We also sequenced a heterozygous diploid clone and show that gene presence/absence variants and other potentially deleterious mutations occur frequently and are a likely cause of inbreeding depression. Gene family expansion, tissue-specific expression and recruitment of genes to new pathways contributed to the evolution of tuber development. The potato genome sequence provides a platform for genetic improvement of this vital crop.
Collapse
|
45
|
Kao WC, Chan AH, Song YS. ECHO: a reference-free short-read error correction algorithm. Genome Res 2011; 21:1181-92. [PMID: 21482625 PMCID: PMC3129260 DOI: 10.1101/gr.111351.110] [Citation(s) in RCA: 63] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2010] [Accepted: 04/06/2011] [Indexed: 01/26/2023]
Abstract
Developing accurate, scalable algorithms to improve data quality is an important computational challenge associated with recent advances in high-throughput sequencing technology. In this study, a novel error-correction algorithm, called ECHO, is introduced for correcting base-call errors in short-reads, without the need of a reference genome. Unlike most previous methods, ECHO does not require the user to specify parameters of which optimal values are typically unknown a priori. ECHO automatically sets the parameters in the assumed model and estimates error characteristics specific to each sequencing run, while maintaining a running time that is within the range of practical use. ECHO is based on a probabilistic model and is able to assign a quality score to each corrected base. Furthermore, it explicitly models heterozygosity in diploid genomes and provides a reference-free method for detecting bases that originated from heterozygous sites. On both real and simulated data, ECHO is able to improve the accuracy of previous error-correction methods by several folds to an order of magnitude, depending on the sequence coverage depth and the position in the read. The improvement is most pronounced toward the end of the read, where previous methods become noticeably less effective. Using a whole-genome yeast data set, it is demonstrated here that ECHO is capable of coping with nonuniform coverage. Also, it is shown that using ECHO to perform error correction as a preprocessing step considerably facilitates de novo assembly, particularly in the case of low-to-moderate sequence coverage depth.
Collapse
Affiliation(s)
- Wei-Chun Kao
- Computer Science Division, University of California, Berkeley, California 94721, USA
| | - Andrew H. Chan
- Computer Science Division, University of California, Berkeley, California 94721, USA
| | - Yun S. Song
- Computer Science Division, University of California, Berkeley, California 94721, USA
- Department of Statistics, University of California, Berkeley, California 94721, USA
| |
Collapse
|
46
|
Salmela L, Schroder J. Correcting errors in short reads by multiple alignments. Bioinformatics 2011; 27:1455-61. [DOI: 10.1093/bioinformatics/btr170] [Citation(s) in RCA: 123] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
47
|
Abstract
Background High-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short read sequencing applications including de novo genome sequencing, genome resequencing, and digital gene expression analysis. Short read error detection is typically carried out by counting the observed frequencies of kmers in reads and validating those with frequencies exceeding a threshold. In case of genomes with high repeat content, an erroneous kmer may be frequently observed if it has few nucleotide differences with valid kmers with multiple occurrences in the genome. Error detection and correction were mostly applied to genomes with low repeat content and this remains a challenging problem for genomes with high repeat content. Results We develop a statistical model and a computational method for error detection and correction in the presence of genomic repeats. We propose a method to infer genomic frequencies of kmers from their observed frequencies by analyzing the misread relationships among observed kmers. We also propose a method to estimate the threshold useful for validating kmers whose estimated genomic frequency exceeds the threshold. We demonstrate that superior error detection is achieved using these methods. Furthermore, we break away from the common assumption of uniformly distributed errors within a read, and provide a framework to model position-dependent error occurrence frequencies common to many short read platforms. Lastly, we achieve better error correction in genomes with high repeat content. Availability: The software is implemented in C++ and is freely available under GNU GPL3 license and Boost Software V1.0 license at “http://aluru-sun.ece.iastate.edu/doku.php?id=redeem”. Conclusions We introduce a statistical framework to model sequencing errors in next-generation reads, which led to promising results in detecting and correcting errors for genomes with high repeat content.
Collapse
Affiliation(s)
- Xiao Yang
- Department of Electrical and Computer Engineering, Iowa State University, Ames, Iowa 50011, USA.
| | | | | |
Collapse
|
48
|
Zhao Z, Yin J, Zhan Y, Xiong W, Li Y, Liu F. PSAEC: An Improved Algorithm for Short Read Error Correction Using Partial Suffix Arrays. FRONTIERS IN ALGORITHMICS AND ALGORITHMIC ASPECTS IN INFORMATION AND MANAGEMENT 2011. [DOI: 10.1007/978-3-642-21204-8_25] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
|
49
|
Zhao Z, Yin J, Li Y, Xiong W, Zhan Y. An Efficient Hybrid Approach to Correcting Errors in Short Reads. LECTURE NOTES IN COMPUTER SCIENCE 2011. [DOI: 10.1007/978-3-642-22589-5_19] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
|
50
|
Koehler R, Issac H, Cloonan N, Grimmond SM. The uniqueome: a mappability resource for short-tag sequencing. ACTA ACUST UNITED AC 2010; 27:272-4. [PMID: 21075741 PMCID: PMC3018812 DOI: 10.1093/bioinformatics/btq640] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Summary: Quantification applications of short-tag sequencing data (such as CNVseq and RNAseq) depend on knowing the uniqueness of specific genomic regions at a given threshold of error. Here, we present the ‘uniqueome’, a genomic resource for understanding the uniquely mappable proportion of genomic sequences. Pre-computed data are available for human, mouse, fly and worm genomes in both color-space and nucletotide-space, and we demonstrate the utility of this resource as applied to the quantification of RNAseq data. Availability: Files, scripts and supplementary data are available from http://grimmond.imb.uq.edu.au/uniqueome/; the ISAS uniqueome aligner is freely available from http://www.imagenix.com/. Contact:n.cloonan@uq.edu.au Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
|