Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For:	Chaisson M, Pevzner P, Tang H. Fragment assembly with short reads. Bioinformatics 2004;20:2067-74. [PMID: 15059830 DOI: 10.1093/bioinformatics/bth205] [Citation(s) in RCA: 137] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Number

Cited by Other Article(s)

Liu M, Xu N, Chen B, Zhang Z, Chen X, Zhu Y, Hong W, Wang T, Zhang Q, Ye Y, Lu T, Qian H. Effects of different assembly strategies on gene annotation in activated sludge. ENVIRONMENTAL RESEARCH 2024;252:119116. [PMID: 38734289 DOI: 10.1016/j.envres.2024.119116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/17/2024] [Revised: 04/27/2024] [Accepted: 05/08/2024] [Indexed: 05/13/2024]

Díaz-Domínguez D, Leinonen M, Salmela L. Space-efficient computation of k-mer dictionaries for large values of k. Algorithms Mol Biol 2024;19:14. [PMID: 38581000 PMCID: PMC10996146 DOI: 10.1186/s13015-024-00259-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2023] [Accepted: 03/02/2024] [Indexed: 04/07/2024] Open

Lopes F, Rossini M, Losacco F, Montanaro G, Gunter N, Tarasov S. Metagenomics reveals that dung beetles (Coleoptera: Scarabaeinae) broadly feed on reptile dung. Did they also feed on that of dinosaurs? Front Ecol Evol 2023. [DOI: 10.3389/fevo.2023.1132729] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/12/2023] Open

Tang T, Hutvagner G, Wang W, Li J. Simultaneous compression of multiple error-corrected short-read sets for faster data transmission and better de novo assemblies. Brief Funct Genomics 2022;21:387-398. [PMID: 35848773 DOI: 10.1093/bfgp/elac016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 06/10/2022] [Accepted: 06/14/2022] [Indexed: 11/14/2022] Open

Abstract

Next-Generation Sequencing has produced incredible amounts of short-reads sequence data for de novo genome assembly over the last decades. For efficient transmission of these huge datasets, high-performance compression algorithms have been intensively studied. As both the de novo assembly and error correction methods utilize the overlaps between reads data, a concern is that the will the sequencing errors bring up negative effects on genome assemblies also affect the compression of the NGS data. This work addresses two problems: how current error correction algorithms can enable the compression algorithms to make the sequence data much more compact, and whether the sequence-modified reads by the error-correction algorithms will lead to quality improvement for de novo contig assembly. As multiple sets of short reads are often produced by a single biomedical project in practice, we propose a graph-based method to reorder the files in the collection of multiple sets and then compress them simultaneously for a further compression improvement after error correction. We use examples to illustrate that accurate error correction algorithms can significantly reduce the number of mismatched nucleotides in the reference-free compression, hence can greatly improve the compression performance. Extensive test on practical collections of multiple short-read sets does confirm that the compression performance on the error-corrected data (with unchanged size) significantly outperforms that on the original data, and that the file reordering idea contributes furthermore. The error correction on the original reads has also resulted in quality improvements of the genome assemblies, sometimes remarkably. However, it is still an open question that how to combine appropriate error correction methods with an assembly algorithm so that the assembly performance can be always significantly improved.

Collapse

Eschenbrenner CJ, Feurtey A, Stukenbrock EH. Population Genomics of Fungal Plant Pathogens and the Analyses of Rapidly Evolving Genome Compartments. Methods Mol Biol 2021;2090:337-355. [PMID: 31975174 DOI: 10.1007/978-1-0716-0199-0_14] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]

Firtina C, Kim JS, Alser M, Senol Cali D, Cicek AE, Alkan C, Mutlu O. Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm. Bioinformatics 2020;36:3669-3679. [PMID: 32167530 DOI: 10.1093/bioinformatics/btaa179] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2019] [Revised: 12/16/2019] [Accepted: 03/11/2020] [Indexed: 11/13/2022] Open

Abstract

MOTIVATION

Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject's genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively.

RESULTS

We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Forward-Backward algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts.

AVAILABILITY AND IMPLEMENTATION

Source code is available at https://github.com/CMU-SAFARI/Apollo.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

New insights on Pseudoalteromonas haloplanktis TAC125 genome organization and benchmarks of genome assembly applications using next and third generation sequencing technologies. Sci Rep 2019;9:16444. [PMID: 31712730 PMCID: PMC6848147 DOI: 10.1038/s41598-019-52832-z] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Accepted: 10/23/2019] [Indexed: 12/21/2022] Open

Firtina C, Bar-Joseph Z, Alkan C, Cicek AE. Hercules: a profile HMM-based hybrid error correction algorithm for long reads. Nucleic Acids Res 2019;46:e125. [PMID: 30124947 PMCID: PMC6265270 DOI: 10.1093/nar/gky724] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2018] [Accepted: 08/07/2018] [Indexed: 01/15/2023] Open

GAAP: A Genome Assembly + Annotation Pipeline. BIOMED RESEARCH INTERNATIONAL 2019;2019:4767354. [PMID: 31346518 PMCID: PMC6617929 DOI: 10.1155/2019/4767354] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 05/20/2019] [Accepted: 05/26/2019] [Indexed: 12/24/2022]

Dal E, Alkan C. Evaluation of genome scaffolding tools using pooled clone sequencing. Turk J Biol 2019;42:471-476. [PMID: 30983868 PMCID: PMC6451843 DOI: 10.3906/biy-1805-42] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open

Gias E, Brosnahan CL, Orr D, Binney B, Ha HJ, Preece MA, Jones B. In vivo growth and genomic characterization of rickettsia-like organisms isolated from farmed Chinook salmon (Oncorhynchus tshawytscha) in New Zealand. JOURNAL OF FISH DISEASES 2018;41:1235-1245. [PMID: 29806079 DOI: 10.1111/jfd.12817] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/09/2018] [Revised: 03/26/2018] [Accepted: 04/04/2018] [Indexed: 06/08/2023]

Hughes JA, Houghten S, Ashlock D. Restarting and recentering genetic algorithm variations for DNA fragment assembly: The necessity of a multi-strategy approach. Biosystems 2016;150:35-45. [PMID: 27521768 DOI: 10.1016/j.biosystems.2016.08.001] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2015] [Revised: 05/09/2016] [Accepted: 08/01/2016] [Indexed: 11/30/2022]

Butorac A, Mekić MS, Hozić A, Diminić J, Gamberger D, Nišavić M, Cindrić M. Benefits of selective peptide derivatization with sulfonating reagent at acidic pH for facile matrix-assisted laser desorption/ionization de novo sequencing. RAPID COMMUNICATIONS IN MASS SPECTROMETRY : RCM 2016;30:1687-1694. [PMID: 28328037 DOI: 10.1002/rcm.7594] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/08/2016] [Revised: 04/23/2016] [Accepted: 04/24/2016] [Indexed: 06/06/2023]

El-Metwally S, Zakaria M, Hamza T. LightAssembler: fast and memory-efficient assembly algorithm for high-throughput sequencing reads. Bioinformatics 2016;32:3215-3223. [PMID: 27412092 DOI: 10.1093/bioinformatics/btw470] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2016] [Accepted: 06/28/2016] [Indexed: 11/13/2022] Open

Abstract

MOTIVATION

The deluge of current sequenced data has exceeded Moore's Law, more than doubling every 2 years since the next-generation sequencing (NGS) technologies were invented. Accordingly, we will able to generate more and more data with high speed at fixed cost, but lack the computational resources to store, process and analyze it. With error prone high throughput NGS reads and genomic repeats, the assembly graph contains massive amount of redundant nodes and branching edges. Most assembly pipelines require this large graph to reside in memory to start their workflows, which is intractable for mammalian genomes. Resource-efficient genome assemblers combine both the power of advanced computing techniques and innovative data structures to encode the assembly graph efficiently in a computer memory.

RESULTS

LightAssembler is a lightweight assembly algorithm designed to be executed on a desktop machine. It uses a pair of cache oblivious Bloom filters, one holding a uniform sample of [Formula: see text]-spaced sequenced [Formula: see text]-mers and the other holding [Formula: see text]-mers classified as likely correct, using a simple statistical test. LightAssembler contains a light implementation of the graph traversal and simplification modules that achieves comparable assembly accuracy and contiguity to other competing tools. Our method reduces the memory usage by [Formula: see text] compared to the resource-efficient assemblers using benchmark datasets from GAGE and Assemblathon projects. While LightAssembler can be considered as a gap-based sequence assembler, different gap sizes result in an almost constant assembly size and genome coverage.

AVAILABILITY AND IMPLEMENTATION

https://github.com/SaraEl-Metwally/LightAssembler CONTACT: sarah_almetwally4@mans.edu.egSupplementary information: Supplementary data are available at Bioinformatics online.

Collapse

Alic AS, Ruzafa D, Dopazo J, Blanquer I. Objective review ofde novostand-alone error correction methods for NGS data. WILEY INTERDISCIPLINARY REVIEWS: COMPUTATIONAL MOLECULAR SCIENCE 2016. [DOI: 10.1002/wcms.1239] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]

Laehnemann D, Borkhardt A, McHardy AC. Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction. Brief Bioinform 2016;17:154-79. [PMID: 26026159 PMCID: PMC4719071 DOI: 10.1093/bib/bbv029] [Citation(s) in RCA: 179] [Impact Index Per Article: 22.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Revised: 04/09/2015] [Indexed: 12/23/2022] Open

Saha S, Rajasekaran S. EC: an efficient error correction algorithm for short reads. BMC Bioinformatics 2015;16 Suppl 17:S2. [PMID: 26678663 PMCID: PMC4674864 DOI: 10.1186/1471-2105-16-s17-s2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/30/2023] Open

Allam A, Kalnis P, Solovyev V. Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data. Bioinformatics 2015;31:3421-8. [DOI: 10.1093/bioinformatics/btv415] [Citation(s) in RCA: 59] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2014] [Accepted: 07/08/2015] [Indexed: 11/12/2022] Open

Song L, Florea L, Langmead B. Lighter: fast and memory-efficient sequencing error correction without counting. Genome Biol 2015;15:509. [PMID: 25398208 PMCID: PMC4248469 DOI: 10.1186/s13059-014-0509-9] [Citation(s) in RCA: 144] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2014] [Indexed: 02/02/2023] Open

Li H. BFC: correcting Illumina sequencing errors. Bioinformatics 2015;31:2885-7. [PMID: 25953801 DOI: 10.1093/bioinformatics/btv290] [Citation(s) in RCA: 115] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2015] [Accepted: 05/02/2015] [Indexed: 11/12/2022] Open

Ashton PM, Perry N, Ellis R, Petrovska L, Wain J, Grant KA, Jenkins C, Dallman TJ. Insight into Shiga toxin genes encoded by Escherichia coli O157 from whole genome sequencing. PeerJ 2015;3:e739. [PMID: 25737808 PMCID: PMC4338798 DOI: 10.7717/peerj.739] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2014] [Accepted: 01/05/2015] [Indexed: 11/20/2022] Open

Black JS, Salto-Tellez M, Mills KI, Catherwood MA. The impact of next generation sequencing technologies on haematological research – A review. ACTA ACUST UNITED AC 2015. [DOI: 10.1016/j.pathog.2015.05.004] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]

Pu D, Qi Y, Cui L, Xiao P, Lu Z. A real-time decoding sequencing based on dual mononucleotide addition for cyclic synthesis. Anal Chim Acta 2014;852:274-83. [PMID: 25441908 DOI: 10.1016/j.aca.2014.09.009] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2014] [Revised: 08/28/2014] [Accepted: 09/08/2014] [Indexed: 11/19/2022]

Molnar M, Ilie L. Correcting Illumina data. Brief Bioinform 2014;16:588-99. [PMID: 25183248 DOI: 10.1093/bib/bbu029] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2014] [Accepted: 08/02/2014] [Indexed: 11/12/2022] Open

Salmela L, Rivals E. LoRDEC: accurate and efficient long read error correction. ACTA ACUST UNITED AC 2014;30:3506-14. [PMID: 25165095 PMCID: PMC4253826 DOI: 10.1093/bioinformatics/btu538] [Citation(s) in RCA: 441] [Impact Index Per Article: 44.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]

Wirawan A, Harris RS, Liu Y, Schmidt B, Schröder J. HECTOR: a parallel multistage homopolymer spectrum based error corrector for 454 sequencing data. BMC Bioinformatics 2014;15:131. [PMID: 24885381 PMCID: PMC4023493 DOI: 10.1186/1471-2105-15-131] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2013] [Accepted: 04/24/2014] [Indexed: 01/29/2023] Open

Lee H, Popodi E, Foster PL, Tang H. Detection of structural variants involving repetitive regions in the reference genome. J Comput Biol 2014;21:219-33. [PMID: 24552580 DOI: 10.1089/cmb.2013.0129] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open

El-Metwally S, Ouda OM, Helmy M. Approaches and Challenges of Next-Generation Sequence Assembly Stages. NEXT GENERATION SEQUENCING TECHNOLOGIES AND CHALLENGES IN SEQUENCE ASSEMBLY 2014. [DOI: 10.1007/978-1-4939-0715-1_9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]

Hicks MA, Prather KLJ. Bioprospecting in the genomic age. ADVANCES IN APPLIED MICROBIOLOGY 2014;87:111-46. [PMID: 24581390 DOI: 10.1016/b978-0-12-800261-2.00003-7] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]

El-Metwally S, Hamza T, Zakaria M, Helmy M. Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput Biol 2013;9:e1003345. [PMID: 24348224 PMCID: PMC3861042 DOI: 10.1371/journal.pcbi.1003345] [Citation(s) in RCA: 68] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open

Predominant Acidilobus-like populations from geothermal environments in yellowstone national park exhibit similar metabolic potential in different hypoxic microbial communities. Appl Environ Microbiol 2013;80:294-305. [PMID: 24162572 DOI: 10.1128/aem.02860-13] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open

Abstract

High-temperature (>70°C) ecosystems in Yellowstone National Park (YNP) provide an unparalleled opportunity to study chemotrophic archaea and their role in microbial community structure and function under highly constrained geochemical conditions. Acidilobus spp. (order Desulfurococcales) comprise one of the dominant phylotypes in hypoxic geothermal sulfur sediment and Fe(III)-oxide environments along with members of the Thermoproteales and Sulfolobales. Consequently, the primary goals of the current study were to analyze and compare replicate de novo sequence assemblies of Acidilobus-like populations from four different mildly acidic (pH 3.3 to 6.1) high-temperature (72°C to 82°C) environments and to identify metabolic pathways and/or protein-encoding genes that provide a detailed foundation of the potential functional role of these populations in situ. De novo assemblies of the highly similar Acidilobus-like populations (>99% 16S rRNA gene identity) represent near-complete consensus genomes based on an inventory of single-copy genes, deduced metabolic potential, and assembly statistics generated across sites. Functional analysis of coding sequences and confirmation of gene transcription by Acidilobus-like populations provide evidence that they are primarily chemoorganoheterotrophs, generating acetyl coenzyme A (acetyl-CoA) via the degradation of carbohydrates, lipids, and proteins, and auxotrophic with respect to several external vitamins, cofactors, and metabolites. No obvious pathways or protein-encoding genes responsible for the dissimilatory reduction of sulfur were identified. The presence of a formate dehydrogenase (Fdh) and other protein-encoding genes involved in mixed-acid fermentation supports the hypothesis that Acidilobus spp. function as degraders of complex organic constituents in high-temperature, mildly acidic, hypoxic geothermal systems.

Collapse

Rapid quantification of sequence repeats to resolve the size, structure and contents of bacterial genomes. BMC Genomics 2013;14:537. [PMID: 23924250 PMCID: PMC3751351 DOI: 10.1186/1471-2164-14-537] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2013] [Accepted: 08/03/2013] [Indexed: 11/10/2022] Open

MacManes MD, Eisen MB. Improving transcriptome assembly through error correction of high-throughput sequence reads. PeerJ 2013;1:e113. [PMID: 23904992 PMCID: PMC3728768 DOI: 10.7717/peerj.113] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2013] [Accepted: 07/03/2013] [Indexed: 01/20/2023] Open

Ilie L, Molnar M. RACER: Rapid and accurate correction of errors in reads. ACTA ACUST UNITED AC 2013;29:2490-3. [PMID: 23853064 DOI: 10.1093/bioinformatics/btt407] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]

Forde BM, O'Toole PW. Next-generation sequencing technologies and their impact on microbial genomics. Brief Funct Genomics 2013;12:440-53. [PMID: 23314033 DOI: 10.1093/bfgp/els062] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open

Liu Y, Schröder J, Schmidt B. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. ACTA ACUST UNITED AC 2012. [PMID: 23202746 DOI: 10.1093/bioinformatics/bts690] [Citation(s) in RCA: 175] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]

Bashir A, Klammer A, Robins WP, Chin CS, Webster D, Paxinos E, Hsu D, Ashby M, Wang S, Peluso P, Sebra R, Sorenson J, Bullard J, Yen J, Valdovino M, Mollova E, Luong K, Lin S, LaMay B, Joshi A, Rowe L, Frace M, Tarr CL, Turnsek M, Davis BM, Kasarskis A, Mekalanos JJ, Waldor MK, Schadt EE. A hybrid approach for the automated finishing of bacterial genomes. Nat Biotechnol 2012;30:701-707. [PMID: 22750883 DOI: 10.1038/nbt.2288] [Citation(s) in RCA: 158] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2012] [Accepted: 05/30/2012] [Indexed: 02/08/2023]

Affiliation(s)

Ali Bashir Pacific Biosciences, Menlo Park, CA
Aaron Klammer Pacific Biosciences, Menlo Park, CA
William P Robins Department of Medicine, Harvard Medical School, Boston, MA
Chen-Shan Chin Pacific Biosciences, Menlo Park, CA
Dale Webster Pacific Biosciences, Menlo Park, CA
Ellen Paxinos Pacific Biosciences, Menlo Park, CA
David Hsu Pacific Biosciences, Menlo Park, CA
Meredith Ashby Pacific Biosciences, Menlo Park, CA
Susana Wang Pacific Biosciences, Menlo Park, CA
Paul Peluso Pacific Biosciences, Menlo Park, CA
Robert Sebra Pacific Biosciences, Menlo Park, CA
Jon Sorenson Pacific Biosciences, Menlo Park, CA
James Bullard Pacific Biosciences, Menlo Park, CA
Jackie Yen Pacific Biosciences, Menlo Park, CA
Marie Valdovino Pacific Biosciences, Menlo Park, CA
Emilia Mollova Pacific Biosciences, Menlo Park, CA
Khai Luong Pacific Biosciences, Menlo Park, CA
Steven Lin Pacific Biosciences, Menlo Park, CA
Brianna LaMay Pacific Biosciences, Menlo Park, CA
Amruta Joshi Pacific Biosciences, Menlo Park, CA
Lori Rowe National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, Atlanta GA 30333
Michael Frace National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, Atlanta GA 30333
Cheryl L Tarr National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, Atlanta GA 30333
Maryann Turnsek National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, Atlanta GA 30333
Brigid M Davis Channing Laboratory, Brigham and Women's Hospital, Boston, MA.,Department of Medicine, Harvard Medical School, Boston, MA.,Department of Microbiology and Molecular Genetics, Harvard Medical School, Boston, MA.,Howard Hughes Medical Institute, Boston, MA
Andrew Kasarskis Pacific Biosciences, Menlo Park, CA
John J Mekalanos Department of Medicine, Harvard Medical School, Boston, MA
Matthew K Waldor Channing Laboratory, Brigham and Women's Hospital, Boston, MA.,Department of Medicine, Harvard Medical School, Boston, MA.,Department of Microbiology and Molecular Genetics, Harvard Medical School, Boston, MA.,Howard Hughes Medical Institute, Boston, MA
Eric E Schadt Pacific Biosciences, Menlo Park, CA.,Department of Genetics and Genomic Sciences, Mount Sinai School of Medicine, New York City

Collapse

Skums P, Dimitrova Z, Campo DS, Vaughan G, Rossi L, Forbi JC, Yokosawa J, Zelikovsky A, Khudyakov Y. Efficient error correction for next-generation sequencing of viral amplicons. BMC Bioinformatics 2012;13 Suppl 10:S6. [PMID: 22759430 PMCID: PMC3382444 DOI: 10.1186/1471-2105-13-s10-s6] [Citation(s) in RCA: 79] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open

Abstract

Background

Next-generation sequencing allows the analysis of an unprecedented number of viral sequence variants from infected patients, presenting a novel opportunity for understanding virus evolution, drug resistance and immune escape. However, sequencing in bulk is error prone. Thus, the generated data require error identification and correction. Most error-correction methods to date are not optimized for amplicon analysis and assume that the error rate is randomly distributed. Recent quality assessment of amplicon sequences obtained using 454-sequencing showed that the error rate is strongly linked to the presence and size of homopolymers, position in the sequence and length of the amplicon. All these parameters are strongly sequence specific and should be incorporated into the calibration of error-correction algorithms designed for amplicon sequencing.

Results

In this paper, we present two new efficient error correction algorithms optimized for viral amplicons: (i) k-mer-based error correction (KEC) and (ii) empirical frequency threshold (ET). Both were compared to a previously published clustering algorithm (SHORAH), in order to evaluate their relative performance on 24 experimental datasets obtained by 454-sequencing of amplicons with known sequences. All three algorithms show similar accuracy in finding true haplotypes. However, KEC and ET were significantly more efficient than SHORAH in removing false haplotypes and estimating the frequency of true ones.

Conclusions

Both algorithms, KEC and ET, are highly suitable for rapid recovery of error-free haplotypes obtained by 454-sequencing of amplicons from heterogeneous viruses.

The implementations of the algorithms and data sets used for their testing are available at: http://alan.cs.gsu.edu/NGS/?q=content/pyrosequencing-error-correction-algorithm

Collapse

Ye C, Ma ZS, Cannon CH, Pop M, Yu DW. Exploiting sparseness in de novo genome assembly. BMC Bioinformatics 2012;13 Suppl 6:S1. [PMID: 22537038 PMCID: PMC3369186 DOI: 10.1186/1471-2105-13-s6-s1] [Citation(s) in RCA: 229] [Impact Index Per Article: 19.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open

Yang X, Chockalingam SP, Aluru S. A survey of error-correction methods for next-generation sequencing. Brief Bioinform 2012;14:56-66. [DOI: 10.1093/bib/bbs015] [Citation(s) in RCA: 177] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Pellin D, Miotto P, Ambrosi A, Cirillo DM, Di Serio C. A genome-wide identification analysis of small regulatory RNAs in Mycobacterium tuberculosis by RNA-Seq and conservation analysis. PLoS One 2012;7:e32723. [PMID: 22470422 PMCID: PMC3314655 DOI: 10.1371/journal.pone.0032723] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2011] [Accepted: 02/03/2012] [Indexed: 12/29/2022] Open

Gene fragmentation in bacterial draft genomes: extent, consequences and mitigation. BMC Genomics 2012;13:14. [PMID: 22233127 PMCID: PMC3322347 DOI: 10.1186/1471-2164-13-14] [Citation(s) in RCA: 51] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2011] [Accepted: 01/10/2012] [Indexed: 12/11/2022] Open

Abstract

UNLABELLED

Ongoing technological advances in genome sequencing are allowing bacterial genomes to be sequenced at ever-lower cost. However, nearly all of these new techniques concomitantly decrease genome quality, primarily due to the inability of their relatively short read lengths to bridge certain genomic regions, e.g., those containing repeats. Fragmentation of predicted open reading frames (ORFs) is one possible consequence of this decreased quality. In this study we quantify ORF fragmentation in draft microbial genomes and its effect on annotation efficacy, and we propose a solution to ameliorate this problem.

RESULTS

A survey of draft-quality genomes in GenBank revealed that fragmented ORFs comprised > 80% of the predicted ORFs in some genomes, and that increased fragmentation correlated with decreased genome assembly quality. In a more thorough analysis of 25 Streptomyces genomes, fragmentation was especially enriched in some protein classes with repeating, multi-modular structures such as polyketide synthases, non-ribosomal peptide synthetases and serine/threonine kinases. Overall, increased genome fragmentation correlated with increased false-negative Pfam and COG annotation rates and increased false-positive KEGG annotation rates. The false-positive KEGG annotation rate could be ameliorated by linking fragmented ORFs using their orthologs in related genomes. Whereas this strategy successfully linked up to 46% of the total ORF fragments in some genomes, its sensitivity appeared to depend heavily on the depth of sampling of a particular taxon's variable genome.

CONCLUSIONS

Draft microbial genomes contain many ORF fragments. Where these correspond to the same gene they have particular potential to confound comparative gene content analyses. Given our findings, and the rapid increase in the number of microbial draft quality genomes, we suggest that accounting for gene fragmentation and its associated biases is important when designing comparative genomic projects.

Collapse

Chapman JA, Ho I, Sunkara S, Luo S, Schroth GP, Rokhsar DS. Meraculous: de novo genome assembly with short paired-end reads. PLoS One 2011;6:e23501. [PMID: 21876754 PMCID: PMC3158087 DOI: 10.1371/journal.pone.0023501] [Citation(s) in RCA: 125] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2010] [Accepted: 07/19/2011] [Indexed: 11/18/2022] Open

Genome sequence and analysis of the tuber crop potato. Nature 2011;475:189-95. [PMID: 21743474 DOI: 10.1038/nature10158] [Citation(s) in RCA: 1198] [Impact Index Per Article: 92.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2011] [Accepted: 05/03/2011] [Indexed: 02/03/2023]

Kao WC, Chan AH, Song YS. ECHO: a reference-free short-read error correction algorithm. Genome Res 2011;21:1181-92. [PMID: 21482625 PMCID: PMC3129260 DOI: 10.1101/gr.111351.110] [Citation(s) in RCA: 63] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2010] [Accepted: 04/06/2011] [Indexed: 01/26/2023]

Salmela L, Schroder J. Correcting errors in short reads by multiple alignments. Bioinformatics 2011;27:1455-61. [DOI: 10.1093/bioinformatics/btr170] [Citation(s) in RCA: 123] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Yang X, Aluru S, Dorman KS. Repeat-aware modeling and correction of short read errors. BMC Bioinformatics 2011;12 Suppl 1:S52. [PMID: 21342585 PMCID: PMC3044310 DOI: 10.1186/1471-2105-12-s1-s52] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open

Abstract

Background

High-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short read sequencing applications including de novo genome sequencing, genome resequencing, and digital gene expression analysis. Short read error detection is typically carried out by counting the observed frequencies of kmers in reads and validating those with frequencies exceeding a threshold. In case of genomes with high repeat content, an erroneous kmer may be frequently observed if it has few nucleotide differences with valid kmers with multiple occurrences in the genome. Error detection and correction were mostly applied to genomes with low repeat content and this remains a challenging problem for genomes with high repeat content.

Results

We develop a statistical model and a computational method for error detection and correction in the presence of genomic repeats. We propose a method to infer genomic frequencies of kmers from their observed frequencies by analyzing the misread relationships among observed kmers. We also propose a method to estimate the threshold useful for validating kmers whose estimated genomic frequency exceeds the threshold. We demonstrate that superior error detection is achieved using these methods. Furthermore, we break away from the common assumption of uniformly distributed errors within a read, and provide a framework to model position-dependent error occurrence frequencies common to many short read platforms. Lastly, we achieve better error correction in genomes with high repeat content. Availability: The software is implemented in C++ and is freely available under GNU GPL3 license and Boost Software V1.0 license at “http://aluru-sun.ece.iastate.edu/doku.php?id=redeem”.

Conclusions

We introduce a statistical framework to model sequencing errors in next-generation reads, which led to promising results in detecting and correcting errors for genomes with high repeat content.

Collapse

Zhao Z, Yin J, Zhan Y, Xiong W, Li Y, Liu F. PSAEC: An Improved Algorithm for Short Read Error Correction Using Partial Suffix Arrays. FRONTIERS IN ALGORITHMICS AND ALGORITHMIC ASPECTS IN INFORMATION AND MANAGEMENT 2011. [DOI: 10.1007/978-3-642-21204-8_25] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]

Zhao Z, Yin J, Li Y, Xiong W, Zhan Y. An Efficient Hybrid Approach to Correcting Errors in Short Reads. LECTURE NOTES IN COMPUTER SCIENCE 2011. [DOI: 10.1007/978-3-642-22589-5_19] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]

Koehler R, Issac H, Cloonan N, Grimmond SM. The uniqueome: a mappability resource for short-tag sequencing. ACTA ACUST UNITED AC 2010;27:272-4. [PMID: 21075741 PMCID: PMC3018812 DOI: 10.1093/bioinformatics/btq640] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]