51
|
Zhao X, Palmer LE, Bolanos R, Mircean C, Fasulo D, Wittenberg GM. EDAR: an efficient error detection and removal algorithm for next generation sequencing data. J Comput Biol 2010; 17:1549-60. [PMID: 20973743 DOI: 10.1089/cmb.2010.0127] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Genomic sequencing techniques introduce experimental errors into reads which can mislead sequence assembly efforts and complicate the diagnostic process. Here we present a method for detecting and removing sequencing errors from reads generated in genomic shotgun sequencing projects prior to sequence assembly. For each input read, the set of all length k substrings (k-mers) it contains are calculated. The read is evaluated based on the frequency with which each k-mer occurs in the complete data set (k-count). For each read, k-mers are clustered using the variable-bandwidth mean-shift algorithm. Based on the k-count of the cluster center, clusters are classified as error regions or non-error regions. For the 23 real and simulated data sets tested (454 and Solexa), our algorithm detected error regions that cover 99% of all errors. A heuristic algorithm is then applied to detect the location of errors in each putative error region. A read is corrected by removing the errors, thereby creating two or more smaller, error-free read fragments. After performing error removal, the error-rate for all data sets tested decreased (∼35-fold reduction, on average). EDAR has comparable accuracy to methods that correct rather than remove errors and when the error rate is greater than 3% for simulated data sets, it performs better. The performance of the Velvet assembler is generally better with error-removed data. However, for short reads, splitting at the location of errors can be problematic. Following error detection with error correction, rather than removal, may improve the assembly results.
Collapse
Affiliation(s)
- Xiaohong Zhao
- Siemens Corporate Research , Princeton, New Jersey, USA
| | | | | | | | | | | |
Collapse
|
52
|
Boisvert S, Laviolette F, Corbeil J. Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. J Comput Biol 2010; 17:1519-33. [PMID: 20958248 DOI: 10.1089/cmb.2009.0238] [Citation(s) in RCA: 362] [Impact Index Per Article: 25.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
An accurate genome sequence of a desired species is now a pre-requisite for genome research. An important step in obtaining a high-quality genome sequence is to correctly assemble short reads into longer sequences accurately representing contiguous genomic regions. Current sequencing technologies continue to offer increases in throughput, and corresponding reductions in cost and time. Unfortunately, the benefit of obtaining a large number of reads is complicated by sequencing errors, with different biases being observed with each platform. Although software are available to assemble reads for each individual system, no procedure has been proposed for high-quality simultaneous assembly based on reads from a mix of different technologies. In this paper, we describe a parallel short-read assembler, called Ray, which has been developed to assemble reads obtained from a combination of sequencing platforms. We compared its performance to other assemblers on simulated and real datasets. We used a combination of Roche/454 and Illumina reads to assemble three different genomes. We showed that mixing sequencing technologies systematically reduces the number of contigs and the number of errors. Because of its open nature, this new tool will hopefully serve as a basis to develop an assembler that can be of universal utilization (availability: http://deNovoAssembler.sf.Net/). For online Supplementary Material , see www.liebertonline.com.
Collapse
|
53
|
Shi H, Schmidt B, Liu W, Müller-Wittig W. A parallel algorithm for error correction in high-throughput short-read data on CUDA-enabled graphics hardware. J Comput Biol 2010; 17:603-15. [PMID: 20426693 DOI: 10.1089/cmb.2009.0062] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Emerging DNA sequencing technologies open up exciting new opportunities for genome sequencing by generating read data with a massive throughput. However, produced reads are significantly shorter and more error-prone compared to the traditional Sanger shotgun sequencing method. This poses challenges for de novo DNA fragment assembly algorithms in terms of both accuracy (to deal with short, error-prone reads) and scalability (to deal with very large input data sets). In this article, we present a scalable parallel algorithm for correcting sequencing errors in high-throughput short-read data so that error-free reads can be available before DNA fragment assembly, which is of high importance to many graph-based short-read assembly tools. The algorithm is based on spectral alignment and uses the Compute Unified Device Architecture (CUDA) programming model. To gain efficiency we are taking advantage of the CUDA texture memory using a space-efficient Bloom filter data structure for spectrum membership queries. We have tested the runtime and accuracy of our algorithm using real and simulated Illumina data for different read lengths, error rates, input sizes, and algorithmic parameters. Using a CUDA-enabled mass-produced GPU (available for less than US$400 at any local computer outlet), this results in speedups of 12-84 times for the parallelized error correction, and speedups of 3-63 times for both sequential preprocessing and parallelized error correction compared to the publicly available Euler-SR program. Our implementation is freely available for download from http://cuda-ec.sourceforge.net .
Collapse
Affiliation(s)
- Haixiang Shi
- School of Computer Engineering, Nanyang Technological University, Singapore
| | | | | | | |
Collapse
|
54
|
Paszkiewicz K, Studholme DJ. De novo assembly of short sequence reads. Brief Bioinform 2010; 11:457-72. [DOI: 10.1093/bib/bbq020] [Citation(s) in RCA: 134] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
55
|
Yang X, Dorman KS, Aluru S. Reptile: representative tiling for short read error correction. Bioinformatics 2010; 26:2526-33. [PMID: 20834037 DOI: 10.1093/bioinformatics/btq468] [Citation(s) in RCA: 118] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Error correction is critical to the success of next-generation sequencing applications, such as resequencing and de novo genome sequencing. It is especially important for high-throughput short-read sequencing, where reads are much shorter and more abundant, and errors more frequent than in traditional Sanger sequencing. Processing massive numbers of short reads with existing error correction methods is both compute and memory intensive, yet the results are far from satisfactory when applied to real datasets. RESULTS We present a novel approach, termed Reptile, for error correction in short-read data from next-generation sequencing. Reptile works with the spectrum of k-mers from the input reads, and corrects errors by simultaneously examining: (i) Hamming distance-based correction possibilities for potentially erroneous k-mers; and (ii) neighboring k-mers from the same read for correct contextual information. By not needing to store input data, Reptile has the favorable property that it can handle data that does not fit in main memory. In addition to sequence data, Reptile can make use of available quality score information. Our experiments show that Reptile outperforms previous methods in the percentage of errors removed from the data and the accuracy in true base assignment. In addition, a significant reduction in run time and memory usage have been achieved compared with previous methods, making it more practical for short-read error correction when sampling larger genomes. AVAILABILITY Reptile is implemented in C++ and is available through the link: http://aluru-sun.ece.iastate.edu/doku.php?id=software CONTACT aluru@iastate.edu.
Collapse
Affiliation(s)
- Xiao Yang
- Department of Electrical and Computer Engineering, Iowa State University, Ames IA 50011, USA
| | | | | |
Collapse
|
56
|
McComish BJ, Hills SFK, Biggs PJ, Penny D. Index-free de novo assembly and deconvolution of mixed mitochondrial genomes. Genome Biol Evol 2010; 2:410-24. [PMID: 20624744 PMCID: PMC2997550 DOI: 10.1093/gbe/evq029] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
Second-generation sequencing technology has allowed a very large increase in sequencing throughput. In order to make use of this high throughput, we have developed a pipeline for sequencing and de novo assembly of multiple mitochondrial genomes without the costs of indexing. Simulation studies on a mixture of diverse animal mitochondrial genomes showed that mitochondrial genomes could be reassembled from a high coverage of short (35 nt) reads, such as those generated by a second-generation Illumina Genome Analyzer. We then assessed this experimentally with long-range polymerase chain reaction products from mitochondria of a human, a rat, a bird, a frog, an insect, and a mollusc. Comparison with reference genomes was used for deconvolution of the assembled contigs rather than for mapping of sequence reads. As proof of concept, we report the complete mollusc mitochondrial genome of an olive shell (Amalda northlandica). It has a very unusual putative control region, which contains a structure that would probably only be detectable by next-generation sequencing. The general approach has considerable potential, especially when combined with indexed sequencing of different groups of genomes.
Collapse
Affiliation(s)
- Bennet J McComish
- Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North, New Zealand.
| | | | | | | |
Collapse
|
57
|
Abstract
MOTIVATION High-throughput sequencing technologies produce large sets of short reads that may contain errors. These sequencing errors make de novo assembly challenging. Error correction aims to reduce the error rate prior assembly. Many de novo sequencing projects use reads from several sequencing technologies to get the benefits of all used technologies and to alleviate their shortcomings. However, combining such a mixed set of reads is problematic as many tools are specific to one sequencing platform. The SOLiD sequencing platform is especially problematic in this regard because of the two base color coding of the reads. Therefore, new tools for working with mixed read sets are needed. RESULTS We present an error correction tool for correcting substitutions, insertions and deletions in a mixed set of reads produced by various sequencing platforms. We first develop a method for correcting reads from any sequencing technology producing base space reads such as the SOLEXA/Illumina and Roche/454 Life Sciences sequencing platforms. We then further refine the algorithm to correct the color space reads from the Applied Biosystems SOLiD sequencing platform together with normal base space reads. Our new tool is based on the SHREC program that is aimed at correcting SOLEXA/Illumina reads. Our experiments show that we can detect errors with 99% sensitivity and >98% specificity if the combined sequencing coverage of the sets is at least 12. We also show that the error rate of the reads is greatly reduced. AVAILABILITY The JAVA source code is freely available at http://www.cs.helsinki.fi/u/lmsalmel/hybrid-shrec/ CONTACT leena.salmela@cs.helsinki.fi
Collapse
Affiliation(s)
- Leena Salmela
- Department of Computer Science, PO Box 68 (Gustaf Hällströmin katu 2b), FI-00014 University of Helsinki, Finland.
| |
Collapse
|
58
|
Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics 2010; 95:315-27. [PMID: 20211242 DOI: 10.1016/j.ygeno.2010.03.001] [Citation(s) in RCA: 626] [Impact Index Per Article: 44.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2009] [Revised: 02/26/2010] [Accepted: 03/02/2010] [Indexed: 01/08/2023]
Abstract
The emergence of next-generation sequencing platforms led to resurgence of research in whole-genome shotgun assembly algorithms and software. DNA sequencing data from the Roche 454, Illumina/Solexa, and ABI SOLiD platforms typically present shorter read lengths, higher coverage, and different error profiles compared with Sanger sequencing data. Since 2005, several assembly software packages have been created or revised specifically for de novo assembly of next-generation sequencing data. This review summarizes and compares the published descriptions of packages named SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo. More generally, it compares the two standard methods known as the de Bruijn graph approach and the overlap/layout/consensus approach to assembly.
Collapse
Affiliation(s)
- Jason R Miller
- J. Craig Venter Institute, Rockville, MD 20850-3343, USA.
| | | | | |
Collapse
|
59
|
Abstract
Metagenomics is a discipline that enables the genomic study of uncultured microorganisms. Faster, cheaper sequencing technologies and the ability to sequence uncultured microbes sampled directly from their habitats are expanding and transforming our view of the microbial world. Distilling meaningful information from the millions of new genomic sequences presents a serious challenge to bioinformaticians. In cultured microbes, the genomic data come from a single clone, making sequence assembly and annotation tractable. In metagenomics, the data come from heterogeneous microbial communities, sometimes containing more than 10,000 species, with the sequence data being noisy and partial. From sampling, to assembly, to gene calling and function prediction, bioinformatics faces new demands in interpreting voluminous, noisy, and often partial sequence data. Although metagenomics is a relative newcomer to science, the past few years have seen an explosion in computational methods applied to metagenomic-based research. It is therefore not within the scope of this article to provide an exhaustive review. Rather, we provide here a concise yet comprehensive introduction to the current computational requirements presented by metagenomics, and review the recent progress made. We also note whether there is software that implements any of the methods presented here, and briefly review its utility. Nevertheless, it would be useful if readers of this article would avail themselves of the comment section provided by this journal, and relate their own experiences. Finally, the last section of this article provides a few representative studies illustrating different facets of recent scientific discoveries made using metagenomics.
Collapse
Affiliation(s)
- John C. Wooley
- Community Cyberinfrastructure for Marine Microbial Ecology Research and Analysis, California Institute for Telecommunications and Information Technology, University of California San Diego, La Jolla, California, United States of America
| | - Adam Godzik
- Community Cyberinfrastructure for Marine Microbial Ecology Research and Analysis, California Institute for Telecommunications and Information Technology, University of California San Diego, La Jolla, California, United States of America
- Program in Bioinformatics and Systems Biology, Burnham Institute for Medical Research, La Jolla, California, United States of America
| | - Iddo Friedberg
- Department of Microbiology, Miami University, Oxford, Ohio, United States of America
- Department of Computer Science and Software Engineering, Miami University, Oxford, Ohio, United States of America
| |
Collapse
|
60
|
Schwartz DC, Waterman MS. New Generations: Sequencing Machines and Their Computational Challenges. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 2010; 25:3-9. [PMID: 22121326 PMCID: PMC3222932 DOI: 10.1007/s11390-010-9300-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
New generation sequencing systems are changing how molecular biology is practiced. The widely promoted $1000 genome will be a reality with attendant changes for healthcare, including personalized medicine. More broadly the genomes of many new organisms with large samplings from populations will be commonplace. What is less appreciated is the explosive demands on computation, both for CPU cycles and storage as well as the need for new computational methods. In this article we will survey some of these developments and demands.
Collapse
Affiliation(s)
- David C. Schwartz
- Laboratory for Molecular and Computational Genomics, Department of Chemistry and Laboratory of Genetics, University of Wisconsin-Madison, WI 53706 USA
| | - Michael S. Waterman
- Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089-2910, USA and Tsinghua University, Beijing, China
| |
Collapse
|
61
|
Peng Y, Leung HCM, Yiu SM, Chin FYL. IDBA – A Practical Iterative de Bruijn Graph De Novo Assembler. LECTURE NOTES IN COMPUTER SCIENCE 2010. [DOI: 10.1007/978-3-642-12683-3_28] [Citation(s) in RCA: 159] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
|
62
|
Abstract
Whole genome shotgun assembly is the process of taking many short sequenced segments (reads) and reconstructing the genome from which they originated. We demonstrate how the technique of bidirected network flow can be used to explicitly model the double-stranded nature of DNA for genome assembly. By combining an algorithm for the Chinese Postman Problem on bidirected graphs with the construction of a bidirected de Bruijn graph, we are able to find the shortest double-stranded DNA sequence that contains a given set of k-long DNA molecules. This is the first exact polynomial time algorithm for the assembly of a double-stranded genome. Furthermore, we propose a maximum likelihood framework for assembling the genome that is the most likely source of the reads, in lieu of the standard maximum parsimony approach (which finds the shortest genome subject to some constraints). In this setting, we give a bidirected network flow-based algorithm that, by taking advantage of high coverage, accurately estimates the copy counts of repeats in a genome. Our second algorithm combines these predicted copy counts with matepair data in order to assemble the reads into contigs. We run our algorithms on simulated read data from Escherichia coli and predict copy counts with extremely high accuracy, while assembling long contigs.
Collapse
Affiliation(s)
- Paul Medvedev
- Department of Computer Science, University of Toronto , Toronto, Canada
| | | |
Collapse
|
63
|
Ye Y, Tang H. An ORFome assembly approach to metagenomics sequences analysis. J Bioinform Comput Biol 2009; 7:455-71. [PMID: 19507285 DOI: 10.1142/s0219720009004151] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2008] [Revised: 11/04/2008] [Accepted: 11/06/2008] [Indexed: 11/18/2022]
Abstract
Metagenomics is an emerging methodology for the direct genomic analysis of a mixed community of uncultured microorganisms. The current analyses of metagenomics data largely rely on the computational tools originally designed for microbial genomics projects. The challenge of assembling metagenomic sequences arises mainly from the short reads and the high species complexity of the community. Alternatively, individual (short) reads will be searched directly against databases of known genes (or proteins) to identify homologous sequences. The latter approach may have low sensitivity and specificity in identifying homologous sequences, which may further bias the subsequent diversity analysis. In this paper, we present a novel approach to metagenomic data analysis, called Metagenomic ORFome Assembly (MetaORFA). The whole computational framework consists of three steps. Each read from a metagenomics project will first be annotated with putative open reading frames (ORFs) that likely encode proteins. Next, the predicted ORFs are assembled into a collection of peptides using an EULER assembly method. Finally, the assembled peptides (i.e. ORFome) are used for database searching of homologs and subsequent diversity analysis. We applied MetaORFA approach to several metagenomics datasets with low coverage short reads. The results show that MetaORFA can produce long peptides even when the sequence coverage of reads is extremely low. Hence, the ORFome assembly significantly increases the sensitivity of homology searching, and may potentially improve the diversity analysis of the metagenomic data. This improvement is especially useful for metagenomic projects when the genome assembly does not work because of the low sequence coverage.
Collapse
Affiliation(s)
- Yuzhen Ye
- School of Informatics, Indiana University, Bloomington, IN 47408, USA.
| | | |
Collapse
|
64
|
Aubin-Horth N, Renn SCP. Genomic reaction norms: using integrative biology to understand molecular mechanisms of phenotypic plasticity. Mol Ecol 2009; 18:3763-80. [PMID: 19732339 DOI: 10.1111/j.1365-294x.2009.04313.x] [Citation(s) in RCA: 256] [Impact Index Per Article: 17.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Phenotypic plasticity is the development of different phenotypes from a single genotype, depending on the environment. Such plasticity is a pervasive feature of life, is observed for various traits and is often argued to be the result of natural selection. A thorough study of phenotypic plasticity should thus include an ecological and an evolutionary perspective. Recent advances in large-scale gene expression technology make it possible to also study plasticity from a molecular perspective, and the addition of these data will help answer long-standing questions about this widespread phenomenon. In this review, we present examples of integrative studies that illustrate the molecular and cellular mechanisms underlying plastic traits, and show how new techniques will grow in importance in the study of these plastic molecular processes. These techniques include: (i) heterologous hybridization to DNA microarrays; (ii) next generation sequencing technologies applied to transcriptomics; (iii) techniques for studying the function of noncoding small RNAs; and (iv) proteomic tools. We also present recent studies on genetic model systems that uncover how environmental cues triggering different plastic responses are sensed and integrated by the organism. Finally, we describe recent work on changes in gene expression in response to an environmental cue that persist after the cue is removed. Such long-term responses are made possible by epigenetic molecular mechanisms, including DNA methylation. The results of these current studies help us outline future avenues for the study of plasticity.
Collapse
Affiliation(s)
- Nadia Aubin-Horth
- Département de Sciences biologiques, Université de Montréal, Québec, Canada.
| | | |
Collapse
|
65
|
Studholme DJ, Ibanez SG, MacLean D, Dangl JL, Chang JH, Rathjen JP. A draft genome sequence and functional screen reveals the repertoire of type III secreted proteins of Pseudomonas syringae pathovar tabaci 11528. BMC Genomics 2009; 10:395. [PMID: 19703286 PMCID: PMC2745422 DOI: 10.1186/1471-2164-10-395] [Citation(s) in RCA: 74] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2009] [Accepted: 08/24/2009] [Indexed: 11/28/2022] Open
Abstract
Background Pseudomonas syringae is a widespread bacterial pathogen that causes disease on a broad range of economically important plant species. Pathogenicity of P. syringae strains is dependent on the type III secretion system, which secretes a suite of up to about thirty virulence 'effector' proteins into the host cytoplasm where they subvert the eukaryotic cell physiology and disrupt host defences. P. syringae pathovar tabaci naturally causes disease on wild tobacco, the model member of the Solanaceae, a family that includes many crop species as well as on soybean. Results We used the 'next-generation' Illumina sequencing platform and the Velvet short-read assembly program to generate a 145X deep 6,077,921 nucleotide draft genome sequence for P. syringae pathovar tabaci strain 11528. From our draft assembly, we predicted 5,300 potential genes encoding proteins of at least 100 amino acids long, of which 303 (5.72%) had no significant sequence similarity to those encoded by the three previously fully sequenced P. syringae genomes. Of the core set of Hrp Outer Proteins that are conserved in three previously fully sequenced P. syringae strains, most were also conserved in strain 11528, including AvrE1, HopAH2, HopAJ2, HopAK1, HopAN1, HopI, HopJ1, HopX1, HrpK1 and HrpW1. However, the hrpZ1 gene is partially deleted and hopAF1 is completely absent in 11528. The draft genome of strain 11528 also encodes close homologues of HopO1, HopT1, HopAH1, HopR1, HopV1, HopAG1, HopAS1, HopAE1, HopAR1, HopF1, and HopW1 and a degenerate HopM1'. Using a functional screen, we confirmed that hopO1, hopT1, hopAH1, hopM1', hopAE1, hopAR1, and hopAI1' are part of the virulence-associated HrpL regulon, though the hopAI1' and hopM1' sequences were degenerate with premature stop codons. We also discovered two additional HrpL-regulated effector candidates and an HrpL-regulated distant homologue of avrPto1. Conclusion The draft genome sequence facilitates the continued development of P. syringae pathovar tabaci on wild tobacco as an attractive model system for studying bacterial disease on plants. The catalogue of effectors sheds further light on the evolution of pathogenicity and host-specificity as well as providing a set of molecular tools for the study of plant defence mechanisms. We also discovered several large genomic regions in Pta 11528 that do not share detectable nucleotide sequence similarity with previously sequenced Pseudomonas genomes. These regions may include horizontally acquired islands that possibly contribute to pathogenicity or epiphytic fitness of Pta 11528.
Collapse
|
66
|
Schröder J, Schröder H, Puglisi SJ, Sinha R, Schmidt B. SHREC: a short-read error correction method. Bioinformatics 2009; 25:2157-63. [PMID: 19542152 DOI: 10.1093/bioinformatics/btp379] [Citation(s) in RCA: 115] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
MOTIVATION Second-generation sequencing technologies produce a massive amount of short reads in a single experiment. However, sequencing errors can cause major problems when using this approach for de novo sequencing applications. Moreover, existing error correction methods have been designed and optimized for shotgun sequencing. Therefore, there is an urgent need for the design of fast and accurate computational methods and tools for error correction of large amounts of short read data. RESULTS We present SHREC, a new algorithm for correcting errors in short-read data that uses a generalized suffix trie on the read data as the underlying data structure. Our results show that the method can identify erroneous reads with sensitivity and specificity of over 99% and 96% for simulated data with error rates of up to 3% as well as for real data. Furthermore, it achieves an error correction accuracy of over 80% for simulated data and over 88% for real data. These results are clearly superior to previously published approaches. SHREC is available as an efficient open-source Java implementation that allows processing of 10 million of short reads on a standard workstation.
Collapse
Affiliation(s)
- Jan Schröder
- Institut für Informatik, Christian-Albrecht-Universität Kiel, Herman-Rodewald-Strasse 3, 24118 Kiel, Germany.
| | | | | | | | | |
Collapse
|
67
|
Assefa S, Keane TM, Otto TD, Newbold C, Berriman M. ABACAS: algorithm-based automatic contiguation of assembled sequences. Bioinformatics 2009; 25:1968-9. [PMID: 19497936 PMCID: PMC2712343 DOI: 10.1093/bioinformatics/btp347] [Citation(s) in RCA: 341] [Impact Index Per Article: 22.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Summary: Due to the availability of new sequencing technologies, we are now increasingly interested in sequencing closely related strains of existing finished genomes. Recently a number of de novo and mapping-based assemblers have been developed to produce high quality draft genomes from new sequencing technology reads. New tools are necessary to take contigs from a draft assembly through to a fully contiguated genome sequence. ABACAS is intended as a tool to rapidly contiguate (align, order, orientate), visualize and design primers to close gaps on shotgun assembled contigs based on a reference sequence. The input to ABACAS is a set of contigs which will be aligned to the reference genome, ordered and orientated, visualized in the ACT comparative browser, and optimal primer sequences are automatically generated. Availability and Implementation: ABACAS is implemented in Perl and is freely available for download from http://abacas.sourceforge.net Contact:sa4@sanger.ac.uk
Collapse
Affiliation(s)
- Samuel Assefa
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SA, UK.
| | | | | | | | | |
Collapse
|
68
|
Abstract
Research into genome assembly algorithms has experienced a resurgence due to new challenges created by the development of next generation sequencing technologies. Several genome assemblers have been published in recent years specifically targeted at the new sequence data; however, the ever-changing technological landscape leads to the need for continued research. In addition, the low cost of next generation sequencing data has led to an increased use of sequencing in new settings. For example, the new field of metagenomics relies on large-scale sequencing of entire microbial communities instead of isolate genomes, leading to new computational challenges. In this article, we outline the major algorithmic approaches for genome assembly and describe recent developments in this domain.
Collapse
Affiliation(s)
- Mihai Pop
- Department of Computer Science and the Center for Bioinformatics and Computational Biology at the University of Maryland, College Park, MD 20742, USA.
| |
Collapse
|
69
|
Su Y, Lin L, Tian G, Chen C, Liu T, Xu X, Qi X, Zhang X, Yang H. Preparing a re-sequencing DNA library of 2 cancer candidate genes using the ligation-by-amplification protocol by two PCR reactions. ACTA ACUST UNITED AC 2009; 52:483-91. [PMID: 19471873 DOI: 10.1007/s11427-009-0066-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2008] [Accepted: 11/18/2008] [Indexed: 01/03/2023]
Abstract
To meet the needs of large-scale genomic/genetic studies, the next-generation massively parallelized sequencing technologies provide high throughput, low cost and low labor-intensive sequencing service, with subsequent bioinformatic software and laboratory methods developed to expand their applications in various types of research. PCR-based genomic/genetic studies, which have significant usage in association studies like cancer research, haven't benefited much from those next-generation sequencing technologies, because the shortgun re-sequencing strategy used by such sequencing machines as the Illumina/Solexa Genome Analyzer may not be applied to direct re-sequencing of short-length target regions like those in PCR-based genomic/genetic studies. Although several methods have been proposed to solve this problem, including microarray-based genomic selections and selector-based technologies, they require advanced equipment and procedures which limit their applications in many laboratories. By contrast, we overcame such potential drawbacks by utilizing a ligation by amplification (LBA) protocol, a method using a pair of Universal Adapters to randomly ligate target regions in a two-step-PCR procedure, whose Long LBA products were easily fragmented and sequenced on the next-generation sequencing machine. In this concept-proven study, we chose the consensus coding sequences of two human cancer genes: BRCA1 and BRCA2 as target regions, specifically designed LBA primer pairs to amplify and randomly ligate them. 70 target sequences were successfully amplified and ligated into Long LBA products, which were then fragmented to construct DNA libraries for sequencing on both a conventional Sanger sequencer ABI 3730xl DNA Analyzer and the next-generation 'synthesis by sequencing technology' Illumina/Solexa Genome Analyzer. Bioinformatic analysis demonstrated the utility and efficiency (including the coverage and depth of each target sequence and the SNPs detection effectiveness) of using the LBA protocol in facilitating PCR-based re-sequencing and genetic-variant-detection studies on the next-generation sequencing machine, raising the prospect of various PCR-based genomic/genetic studies using this strategy.
Collapse
Affiliation(s)
- Yeyang Su
- Graduate School of Chinese Academy of Sciences, Beijing, 100049, China
| | | | | | | | | | | | | | | | | |
Collapse
|
70
|
Blazewicz J, Bryja M, Figlerowicz M, Gawron P, Kasprzak M, Kirton E, Platt D, Przybytek J, Swiercz A, Szajkowski L. Whole genome assembly from 454 sequencing output via modified DNA graph concept. Comput Biol Chem 2009; 33:224-30. [PMID: 19477687 DOI: 10.1016/j.compbiolchem.2009.04.005] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2008] [Revised: 03/29/2009] [Accepted: 04/23/2009] [Indexed: 11/18/2022]
Abstract
Recently, 454 Life Sciences Corporation proposed a new biochemical approach to DNA sequencing (the 454 sequencing). It is based on the pyrosequencing protocol. The 454 sequencing aims to give reliable output at a low cost and in a short time. The produced sequences are shorter than reads produced by classical methods. Our paper proposes a new DNA assembly algorithm which deals well with such data and outperforms other assembly algorithms used in practice. The constructed SR-ASM algorithm is a heuristic method based on a graph model, the graph being a modified DNA graph proposed for DNA sequencing by hybridization procedure. Other new features of the assembly algorithm are, among others, temporary compression of input sequences, and a new and fast multiple alignment heuristics taking advantage of the way the output data for the 454 sequencing are presented and coded. The usefulness of the algorithm has been proved in tests on raw data generated during sequencing of the whole 1.84Mbp genome of Prochlorococcus marinus bacteria and also on a part of chromosome 15 of Homo sapiens. The source code of SR-ASM can be downloaded from http://bio.cs.put.poznan.pl/ in the section 'Current research'--> 'DNA Assembly'. Among publicly available assemblers our algorithm appeared to generate the best results, especially in the number of produced contigs and in the lengths of the contigs with high similarity to the genome sequence.
Collapse
Affiliation(s)
- Jacek Blazewicz
- Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland
| | | | | | | | | | | | | | | | | | | |
Collapse
|
71
|
Forster RE, Chiesl TN, Fredlake CP, White CV, Barron AE. Hydrophobically modified polyacrylamide block copolymers for fast, high-resolution DNA sequencing in microfluidic chips. Electrophoresis 2009; 29:4669-76. [PMID: 19053064 DOI: 10.1002/elps.200800353] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
By using a microfluidic electrophoresis platform to perform DNA sequencing, genomic information can be obtained more quickly and affordably than the currently employed capillary array electrophoresis instruments. Previous research in our group has shown that physically cross-linked, hydrophobically modified polyacrylamide matrices separate dsDNA more effectively than linear polyacrylamide (LPA) solutions. Expanding upon this work, we have synthesized a series of LPA-co-dihexylacrylamide block copolymers specifically designed to electrophoretically sequence ssDNA quickly and efficiently on a microfluidic device. By incorporating very small amounts of N,N-dihexylacrylamide, a hydrophobic monomer, these copolymer solutions achieved up to approximately 10% increases in average DNA sequencing read length over LPA homopolymer solutions of matched molar mass. Additionally, the inclusion of the small amount of hydrophobe does not significantly increase the polymer solution viscosities, relative to LPA solutions, so that channel loading times between the copolymers and the homopolymers are similar. The resulting polymer solutions are capable of providing enhanced sequencing separations in a short period of time without compromising the ability to rapidly load and unload the matrix from a microfluidic device.
Collapse
Affiliation(s)
- Ryan E Forster
- Department of Materials Science and Engineering, Northwestern University, Evanston, IL, USA
| | | | | | | | | |
Collapse
|
72
|
Abstract
Background New short-read sequencing technologies produce enormous volumes of 25–30 base paired-end reads. The resulting reads have vastly different characteristics than produced by Sanger sequencing, and require different approaches than the previous generation of sequence assemblers. In this paper, we present a short-read de novo assembler particularly targeted at the new ABI SOLiD sequencing technology. Results This paper presents what we believe to be the first de novo sequence assembly results on real data from the emerging SOLiD platform, introduced by Applied Biosystems. Our assembler SHORTY augments short-paired reads using a trivially small number (5 – 10) of seeds of length 300 – 500 bp. These seeds enable us to produce significant assemblies using short-read coverage no more than 100×, which can be obtained in a single run of these high-capacity sequencers. SHORTY exploits two ideas which we believe to be of interest to the short-read assembly community: (1) using single seed reads to crystallize assemblies, and (2) estimating intercontig distances accurately from multiple spanning paired-end reads. Conclusion We demonstrate effective assemblies (N50 contig sizes ~40 kb) of three different bacterial species using simulated SOLiD data. Sequencing artifacts limit our performance on real data, however our results on this data are substantially better than those achieved by competing assemblers.
Collapse
|
73
|
Chin FYL, Leung HCM, Li WL, Yiu SM. Finding optimal threshold for correction error reads in DNA assembling. BMC Bioinformatics 2009; 10 Suppl 1:S15. [PMID: 19208114 PMCID: PMC2648749 DOI: 10.1186/1471-2105-10-s1-s15] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND DNA assembling is the problem of determining the nucleotide sequence of a genome from its substrings, called reads. In the experiments, there may be some errors on the reads which affect the performance of the DNA assembly algorithms. Existing algorithms, e.g. ECINDEL and SRCorr, correct the error reads by considering the number of times each length-k substring of the reads appear in the input. They treat those length-k substrings appear at least M times as correct substring and correct the error reads based on these substrings. However, since the threshold M is chosen without any solid theoretical analysis, these algorithms cannot guarantee their performances on error correction. RESULTS In this paper, we propose a method to calculate the probabilities of false positive and false negative when determining whether a length-k substring is correct using threshold M. Based on this optimal threshold M that minimizes the total errors (false positives and false negatives). Experimental results on both real data and simulated data showed that our calculation is correct and we can reduce the total error substrings by 77.6% and 65.1% when compared to ECINDEL and SRCorr respectively. CONCLUSION We introduced a method to calculate the probability of false positives and false negatives of the length-k substring using different thresholds. Based on this calculation, we found the optimal threshold to minimize the total error of false positive plus false negative.
Collapse
Affiliation(s)
- Francis Y L Chin
- Department of Computer Science, The University of Hong Kong, Pokfulam, Hong Kong, PRChina.
| | | | | | | |
Collapse
|
74
|
|
75
|
Chaisson MJ, Brinza D, Pevzner PA. De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome Res 2008; 19:336-46. [PMID: 19056694 DOI: 10.1101/gr.079053.108] [Citation(s) in RCA: 208] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Increasing read length is currently viewed as the crucial condition for fragment assembly with next-generation sequencing technologies. However, introducing mate-paired reads (separated by a gap of length, GapLength) opens a possibility to transform short mate-pairs into long mate-reads of length approximately GapLength, and thus raises the question as to whether the read length (as opposed to GapLength) even matters. We describe a new tool, EULER-USR, for assembling mate-paired short reads and use it to analyze the question of whether the read length matters. We further complement the ongoing experimental efforts to maximize read length by a new computational approach for increasing the effective read length. While the common practice is to trim the error-prone tails of the reads, we present an approach that substitutes trimming with error correction using repeat graphs. An important and counterintuitive implication of this result is that one may extend sequencing reactions that degrade with length "past their prime" to where the error rate grows above what is normally acceptable for fragment assembly.
Collapse
Affiliation(s)
- Mark J Chaisson
- Bioinformatics Program, University of California San Diego, La Jolla, California 92093, USA.
| | | | | |
Collapse
|
76
|
Hert DG, Fredlake CP, Barron AE. Advantages and limitations of next-generation sequencing technologies: A comparison of electrophoresis and non-electrophoresis methods. Electrophoresis 2008; 29:4618-26. [DOI: 10.1002/elps.200800456] [Citation(s) in RCA: 103] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
|
77
|
Abstract
This article is concerned with statistical modeling of shotgun resequencing data and the use of such data for population genetic inference. We model data produced by sequencing-by-synthesis technologies such as the Solexa, 454, and polymerase colony (polony) systems, whose use is becoming increasingly widespread. We show how such data can be used to estimate evolutionary parameters (mutation and recombination rates), despite the fact that the data do not necessarily provide complete or aligned sequence information. We also present two refinements of our methods: one that is more robust to sequencing errors and another that can be used when no reference genome is available.
Collapse
|
78
|
Sorber K, Chiu C, Webster D, Dimon M, Ruby JG, Hekele A, DeRisi JL. The long march: a sample preparation technique that enhances contig length and coverage by high-throughput short-read sequencing. PLoS One 2008; 3:e3495. [PMID: 18941527 PMCID: PMC2566813 DOI: 10.1371/journal.pone.0003495] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2008] [Accepted: 09/30/2008] [Indexed: 11/20/2022] Open
Abstract
High-throughput short-read technologies have revolutionized DNA sequencing by drastically reducing the cost per base of sequencing information. Despite producing gigabases of sequence per run, these technologies still present obstacles in resequencing and de novo assembly applications due to biased or insufficient target sequence coverage. We present here a simple sample preparation method termed the “long march” that increases both contig lengths and target sequence coverage using high-throughput short-read technologies. By incorporating a Type IIS restriction enzyme recognition motif into the sequencing primer adapter, successive rounds of restriction enzyme cleavage and adapter ligation produce a set of nested sub-libraries from the initial amplicon library. Sequence reads from these sub-libraries are offset from each other with enough overlap to aid assembly and contig extension. We demonstrate the utility of the long march in resequencing of the Plasmodium falciparum transcriptome, where the number of genomic bases covered was increased by 39%, as well as in metagenomic analysis of a serum sample from a patient with hepatitis B virus (HBV)-related acute liver failure, where the number of HBV bases covered was increased by 42%. We also offer a theoretical optimization of the long march for de novo sequence assembly.
Collapse
Affiliation(s)
- Katherine Sorber
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, California, United States of America
| | - Charles Chiu
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, California, United States of America
- Division of Infectious Diseases, Department of Medicine, University of California San Francisco, San Francisco, California, United States of America
| | - Dale Webster
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, California, United States of America
- Biological and Medical Informatics Program, University of California San Francisco, San Francisco, California, United States of America
| | - Michelle Dimon
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, California, United States of America
- Biological and Medical Informatics Program, University of California San Francisco, San Francisco, California, United States of America
| | - J. Graham Ruby
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, California, United States of America
| | - Armin Hekele
- Department of Microbiology and Immunology, University of California San Francisco, San Francisco, California, United States of America
| | - Joseph L. DeRisi
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, California, United States of America
- Howard Hughes Medical Institute, University of California San Francisco, San Francisco, California, United States of America
- * E-mail:
| |
Collapse
|
79
|
Rapidly developing functional genomics in ecological model systems via 454 transcriptome sequencing. Genetica 2008; 138:433-51. [PMID: 18931921 DOI: 10.1007/s10709-008-9326-y] [Citation(s) in RCA: 91] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2008] [Accepted: 09/22/2008] [Indexed: 10/21/2022]
Abstract
Next generation sequencing technology affords new opportunities in ecological genetics. This paper addresses how an ecological genetics research program focused on a phenotype of interest can quickly move from no genetic resources to having various functional genomic tools. 454 sequencing and its error rates are discussed, followed by a review of de novo transcriptome assemblies focused on the first successful de novo assembly which happens to be in an ecological model system (the Glanville fritillary butterfly). The potential future developments in 454 sequencing are also covered. Particular attention is paid to the difficulties ecological geneticists are likely to encounter through reviewing relevant studies in both model and non-model systems. Various post-sequencing issues and applications of 454 generated data are presented (e.g. database management, microarray construction, molecular marker and candidate gene development). How to use species with genomic resources to inform study of those without is also discussed. In closing, some of the drawbacks of 454 sequencing are presented along with future prospects of this technology.
Collapse
|
80
|
Paten B, Herrero J, Beal K, Fitzgerald S, Birney E. Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. Genome Res 2008; 18:1814-28. [PMID: 18849524 DOI: 10.1101/gr.076554.108] [Citation(s) in RCA: 205] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Pairwise whole-genome alignment involves the creation of a homology map, capable of performing a near complete transformation of one genome into another. For multiple genomes this problem is generalized to finding a set of consistent homology maps for converting each genome in the set of aligned genomes into any of the others. The problem can be divided into two principal stages. First, the partitioning of the input genomes into a set of colinear segments, a process which essentially deals with the complex processes of rearrangement. Second, the generation of a base pair level alignment map for each colinear segment. We have developed a new genome-wide segmentation program, Enredo, which produces colinear segments from extant genomes handling rearrangements, including duplications. We have then applied the new alignment program Pecan, which makes the consistency alignment methodology practical at a large scale, to create a new set of genome-wide mammalian alignments. We test both Enredo and Pecan using novel and existing assessment analyses that incorporate both real biological data and simulations, and show that both independently and in combination they outperform existing programs. Alignments from our pipeline are publicly available within the Ensembl genome browser.
Collapse
Affiliation(s)
- Benedict Paten
- Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California 95064, USA.
| | | | | | | | | |
Collapse
|
81
|
Hajirasouliha I, Hormozdiari F, Sahinalp SC, Birol I. Optimal pooling for genome re-sequencing with ultra-high-throughput short-read technologies. Bioinformatics 2008; 24:i32-40. [PMID: 18586730 PMCID: PMC2718651 DOI: 10.1093/bioinformatics/btn173] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
New generation sequencing technologies offer unique opportunities and challenges for re-sequencing studies. In this article, we focus on re-sequencing experiments using the Solexa technology, based on bacterial artificial chromosome (BAC) clones, and address an experimental design problem. In these specific experiments, approximate coordinates of the BACs on a reference genome are known, and fine-scale differences between the BAC sequences and the reference are of interest. The high-throughput characteristics of the sequencing technology makes it possible to multiplex BAC sequencing experiments by pooling BACs for a cost-effective operation. However, the way BACs are pooled in such re-sequencing experiments has an effect on the downstream analysis of the generated data, mostly due to subsequences common to multiple BACs. The experimental design strategy we develop in this article offers combinatorial solutions based on approximation algorithms for the well-known max n-cut problem and the related max n-section problem on hypergraphs. Our algorithms, when applied to a number of sample cases give more than a 2-fold performance improvement over random partitioning. Contact:cenk@cs.sfu.ca
Collapse
Affiliation(s)
- Iman Hajirasouliha
- Lab for Computational Biology, Simon Fraser University, Burnaby, BC, Canada
| | | | | | | |
Collapse
|
82
|
Simple tools for assembling and searching high-density picolitre pyrophosphate sequence data. SOURCE CODE FOR BIOLOGY AND MEDICINE 2008; 3:5. [PMID: 18423012 PMCID: PMC2374781 DOI: 10.1186/1751-0473-3-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/16/2007] [Accepted: 04/18/2008] [Indexed: 11/16/2022]
Abstract
Background The advent of pyrophosphate sequencing makes large volumes of sequencing data available at a lower cost than previously possible. However, the short read lengths are difficult to assemble and the large dataset is difficult to handle. During the sequencing of a virus from the tsetse fly, Glossina pallidipes, we found the need for tools to search quickly a set of reads for near exact text matches. Methods A set of tools is provided to search a large data set of pyrophosphate sequence reads under a "live" CD version of Linux on a standard PC that can be used by anyone without prior knowledge of Linux and without having to install a Linux setup on the computer. The tools permit short lengths of de novo assembly, checking of existing assembled sequences, selection and display of reads from the data set and gathering counts of sequences in the reads. Results Demonstrations are given of the use of the tools to help with checking an assembly against the fragment data set; investigating homopolymer lengths, repeat regions and polymorphisms; and resolving inserted bases caused by incomplete chain extension. Conclusion The additional information contained in a pyrophosphate sequencing data set beyond a basic assembly is difficult to access due to a lack of tools. The set of simple tools presented here would allow anyone with basic computer skills and a standard PC to access this information.
Collapse
|
83
|
Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 2008; 18:821-9. [PMID: 18349386 DOI: 10.1101/gr.074492.107] [Citation(s) in RCA: 7040] [Impact Index Per Article: 440.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
We have developed a new set of algorithms, collectively called "Velvet," to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representation based on short words (k-mers) that is ideal for high coverage, very short read (25-50 bp) data sets. Applying Velvet to very short reads and paired-ends information only, one can produce contigs of significant length, up to 50-kb N50 length in simulations of prokaryotic data and 3-kb N50 on simulated mammalian BACs. When applied to real Solexa data sets without read pairs, Velvet generated contigs of approximately 8 kb in a prokaryote and 2 kb in a mammalian BAC, in close agreement with our simulated results without read-pair information. Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies.
Collapse
Affiliation(s)
- Daniel R Zerbino
- EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | | |
Collapse
|
84
|
Pop M, Salzberg SL. Bioinformatics challenges of new sequencing technology. Trends Genet 2008; 24:142-9. [DOI: 10.1016/j.tig.2007.12.006] [Citation(s) in RCA: 234] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2007] [Revised: 12/18/2007] [Accepted: 12/19/2007] [Indexed: 12/24/2022]
|
85
|
Fredlake CP, Hert DG, Kan CW, Chiesl TN, Root BE, Forster RE, Barron AE. Ultrafast DNA sequencing on a microchip by a hybrid separation mechanism that gives 600 bases in 6.5 minutes. Proc Natl Acad Sci U S A 2008; 105:476-81. [PMID: 18184818 PMCID: PMC2206561 DOI: 10.1073/pnas.0705093105] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2007] [Indexed: 01/17/2023] Open
Abstract
To realize the immense potential of large-scale genomic sequencing after the completion of the second human genome (Venter's), the costs for the complete sequencing of additional genomes must be dramatically reduced. Among the technologies being developed to reduce sequencing costs, microchip electrophoresis is the only new technology ready to produce the long reads most suitable for the de novo sequencing and assembly of large and complex genomes. Compared with the current paradigm of capillary electrophoresis, microchip systems promise to reduce sequencing costs dramatically by increasing throughput, reducing reagent consumption, and integrating the many steps of the sequencing pipeline onto a single platform. Although capillary-based systems require approximately 70 min to deliver approximately 650 bases of contiguous sequence, we report sequencing up to 600 bases in just 6.5 min by microchip electrophoresis with a unique polymer matrix/adsorbed polymer wall coating combination. This represents a two-thirds reduction in sequencing time over any previously published chip sequencing result, with comparable read length and sequence quality. We hypothesize that these ultrafast long reads on chips can be achieved because the combined polymer system engenders a recently discovered "hybrid" mechanism of DNA electromigration, in which DNA molecules alternate rapidly between repeating through the intact polymer network and disrupting network entanglements to drag polymers through the solution, similar to dsDNA dynamics we observe in single-molecule DNA imaging studies. Most importantly, these results reveal the surprisingly powerful ability of microchip electrophoresis to provide ultrafast Sanger sequencing, which will translate to increased system throughput and reduced costs.
Collapse
Affiliation(s)
| | | | | | | | - Brian E. Root
- Materials Science and Engineering, Northwestern University, Evanston, IL 60208
| | - Ryan E. Forster
- Materials Science and Engineering, Northwestern University, Evanston, IL 60208
| | | |
Collapse
|
86
|
Abstract
In the last year, high-throughput sequencing technologies have progressed from proof-of-concept to production quality. While these methods produce high-quality reads, they have yet to produce reads comparable in length to Sanger-based sequencing. Current fragment assembly algorithms have been implemented and optimized for mate-paired Sanger-based reads, and thus do not perform well on short reads produced by short read technologies. We present a new Eulerian assembler that generates nearly optimal short read assemblies of bacterial genomes and describe an approach to assemble reads in the case of the popular hybrid protocol when short and long Sanger-based reads are combined.
Collapse
|
87
|
Sundquist A, Ronaghi M, Tang H, Pevzner P, Batzoglou S. Whole-genome sequencing and assembly with high-throughput, short-read technologies. PLoS One 2007; 2:e484. [PMID: 17534434 PMCID: PMC1871613 DOI: 10.1371/journal.pone.0000484] [Citation(s) in RCA: 94] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2007] [Accepted: 05/05/2007] [Indexed: 11/18/2022] Open
Abstract
While recently developed short-read sequencing technologies may dramatically reduce the sequencing cost and eventually achieve the $1000 goal for re-sequencing, their limitations prevent the de novo sequencing of eukaryotic genomes with the standard shotgun sequencing protocol. We present SHRAP (SHort Read Assembly Protocol), a sequencing protocol and assembly methodology that utilizes high-throughput short-read technologies. We describe a variation on hierarchical sequencing with two crucial differences: (1) we select a clone library from the genome randomly rather than as a tiling path and (2) we sample clones from the genome at high coverage and reads from the clones at low coverage. We assume that 200 bp read lengths with a 1% error rate and inexpensive random fragment cloning on whole mammalian genomes is feasible. Our assembly methodology is based on first ordering the clones and subsequently performing read assembly in three stages: (1) local assemblies of regions significantly smaller than a clone size, (2) clone-sized assemblies of the results of stage 1, and (3) chromosome-sized assemblies. By aggressively localizing the assembly problem during the first stage, our method succeeds in assembling short, unpaired reads sampled from repetitive genomes. We tested our assembler using simulated reads from D. melanogaster and human chromosomes 1, 11, and 21, and produced assemblies with large sets of contiguous sequence and a misassembly rate comparable to other draft assemblies. Tested on D. melanogaster and the entire human genome, our clone-ordering method produces accurate maps, thereby localizing fragment assembly and enabling the parallelization of the subsequent steps of our pipeline. Thus, we have demonstrated that truly inexpensive de novo sequencing of mammalian genomes will soon be possible with high-throughput, short-read technologies using our methodology.
Collapse
Affiliation(s)
- Andreas Sundquist
- Department of Computer Science, Stanford University, Stanford, California, United States of America.
| | | | | | | | | |
Collapse
|
88
|
Abstract
Summary: Novel DNA sequencing technologies with the potential for up to three orders magnitude more sequence throughput than conventional Sanger sequencing are emerging. The instrument now available from Solexa Ltd, produces millions of short DNA sequences of 25 nt each. Due to ubiquitous repeats in large genomes and the inability of short sequences to uniquely and unambiguously characterize them, the short read length limits applicability for de novo sequencing. However, given the sequencing depth and the throughput of this instrument, stringent assembly of highly identical sequences can be achieved. We describe SSAKE, a tool for aggressively assembling millions of short nucleotide sequences by progressively searching through a prefix tree for the longest possible overlap between any two sequences. SSAKE is designed to help leverage the information from short sequence reads by stringently assembling them into contiguous sequences that can be used to characterize novel sequencing targets. Availability: Contact:rwarren@bcgsc.ca
Collapse
Affiliation(s)
- René L Warren
- British Columbia Cancer Agency, Genome Sciences Centre, 675 West 10th Avenue, Vancouver, BC V5Z 1L3, Canada.
| | | | | | | |
Collapse
|
89
|
Fredlake CP, Hert DG, Mardis ER, Barron AE. What is the future of electrophoresis in large-scale genomic sequencing? Electrophoresis 2006; 27:3689-702. [PMID: 17031784 DOI: 10.1002/elps.200600408] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Although a finished human genome reference sequence is now available, the ability to sequence large, complex genomes remains critically important for researchers in the biological sciences, and in particular, continued human genomic sequence determination will ultimately help to realize the promise of medical care tailored to an individual's unique genetic identity. Many new technologies are being developed to decrease the costs and to dramatically increase the data acquisition rate of such sequencing projects. These new sequencing approaches include Sanger reaction-based technologies that have electrophoresis as the final separation step as well as those that use completely novel, nonelectrophoretic methods to generate sequence data. In this review, we discuss the various advances in sequencing technologies and evaluate the current limitations of novel methods that currently preclude their complete acceptance in large-scale sequencing projects. Our primary goal is to analyze and predict the continuing role of electrophoresis in large-scale DNA sequencing, both in the near and longer term.
Collapse
Affiliation(s)
- Christopher P Fredlake
- Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL, USA
| | | | | | | |
Collapse
|
90
|
Abstract
The classical theory of shotgun DNA sequencing accounts for neither the placement dependencies that are a fundamental consequence of the forward-reverse sequencing strategy, nor the edge effect that arises for small to moderate-sized genomic targets. These phenomena are relevant to a number of sequencing scenarios, including large-insert BAC and fosmid clones, filtered genomic libraries, and macro-nuclear chromosomes. Here, we report a model that considers these two effects and provides both the expected value of coverage and its variance. Comparison to methyl-filtered maize data shows significant improvement over classical theory. The model is used to analyze coverage performance over a range of small to moderately-sized genomic targets. We find that the read pairing effect and the edge effect interact in a non-trivial fashion. Shorter reads give superior coverage per unit sequence depth relative to longer ones. In principle, end-sequences can be optimized with respect to template insert length; however, optimal performance is unlikely to be realized in most cases because of inherent size variation in any set of targets. Conversely, single-stranded reads exhibit roughly the same coverage attributes as optimized end-reads. Although linking information is lost, single-stranded data should not pose a significant assembly liability if the target represents predominantly low-copy sequence. We also find that random sequencing should be halted at substantially lower redundancies than those now associated with larger projects. Given the enormous amount of data generated per cycle on pyro-sequencing instruments, this observation suggests devising schemes to split each run cycle between twoor more projects. This would prevent over-sequencing and would further leverage the pyrosequencing method.
Collapse
Affiliation(s)
- Michael C Wendl
- Genome Sequencing Center, Washington University, St. Louis, Missouri 63108, USA.
| |
Collapse
|
91
|
Wendl MC. Occupancy modeling of coverage distribution for whole genome shotgun DNA sequencing. Bull Math Biol 2006; 68:179-96. [PMID: 16794926 DOI: 10.1007/s11538-005-9021-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2004] [Accepted: 03/15/2005] [Indexed: 10/24/2022]
Abstract
Expected-value models have long provided a rudimentary theoretical foundation for random DNA sequencing. Here, we are interested in improving characterization of genome coverage in terms of its underlying probability distributions. We find that the mathematical notion of occupancy serves as a good model for evolution of the coverage distribution function and reveals new insights related to sequence redundancy. Established concepts, such as "full shotgun depth," have been assumed invariant, but actually depend on project size and decrease over time. For most microbial projects, the full shotgun milestone should be revised downward by about 30%. Accordingly, many already-completed genomes appear to have been over-sequenced. Results also suggest that read lengths for emerging high-throughput sequencing methods must be increased substantially before they can be considered as possible successors to the standard Sanger method. In particular, gains in throughput and sequence depth cannot be made to compensate for diminished read length. Limits are well approximated by a simple logarithmic equation, which should be useful in estimating maximum coverage-based redundancy for future projects.
Collapse
Affiliation(s)
- Michael C Wendl
- Genome Sequencing Center, Washington University, 4444 Forest Park Boulevard, Campus Box 8501, St. Louis, MO 63108, USA.
| |
Collapse
|
92
|
Abstract
Demand for DNA sequence information has never been greater, yet current Sanger technology is too costly, time consuming, and labor intensive to meet this ongoing demand. Applications span numerous research interests, including sequence variation studies, comparative genomics and evolution, forensics, and diagnostic and applied therapeutics. Several emerging technologies show promise of delivering next-generation solutions for fast and affordable genome sequencing. In this review article, the DNA polymerase-dependent strategies of Sanger sequencing, single nucleotide addition, and cyclic reversible termination are discussed to highlight recent advances and potential challenges these technologies face in their development for ultrafast DNA sequencing.
Collapse
Affiliation(s)
- Michael L Metzker
- Human Genome Sequencing Center and Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA.
| |
Collapse
|
93
|
Whiteford N, Haslam N, Weber G, Prügel-Bennett A, Essex JW, Roach PL, Bradley M, Neylon C. An analysis of the feasibility of short read sequencing. Nucleic Acids Res 2005; 33:e171. [PMID: 16275781 PMCID: PMC1278949 DOI: 10.1093/nar/gni170] [Citation(s) in RCA: 93] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Several methods for ultra high-throughput DNA sequencing are currently under investigation. Many of these methods yield very short blocks of sequence information (reads). Here we report on an analysis showing the level of genome sequencing possible as a function of read length. It is shown that re-sequencing and de novo sequencing of the majority of a bacterial genome is possible with read lengths of 20–30 nt, and that reads of 50 nt can provide reconstructed contigs (a contiguous fragment of sequence data) of 1000 nt and greater that cover 80% of human chromosome 1.
Collapse
Affiliation(s)
| | | | | | - Adam Prügel-Bennett
- School of Electronics and Computer Science, University of SouthamptonSouthampton SO17 1BJ, UK
| | | | | | - Mark Bradley
- School of Chemistry, University of EdinburghEdinburgh EH9 3JJ, UK
| | - Cameron Neylon
- To whom correspondence should be addressed. Tel: +44 23 8059 4164; Fax: +44 23 8059 6805;
| |
Collapse
|
94
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2005. [PMCID: PMC2448604 DOI: 10.1002/cfg.419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
|