101
|
Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HOK, Buffalo V, Zerbino DR, Diekhans M, Nguyen N, Ariyaratne PN, Sung WK, Ning Z, Haimel M, Simpson JT, Fonseca NA, Birol İ, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelley DR, Phillippy AM, Koren S, Yang SP, Wu W, Chou WC, Srivastava A, Shaw TI, Ruby JG, Skewes-Cox P, Betegon M, Dimon MT, Solovyev V, Seledtsov I, Kosarev P, Vorobyev D, Ramirez-Gonzalez R, Leggett R, MacLean D, Xia F, Luo R, Li Z, Xie Y, Liu B, Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Yin S, Sharpe T, Hall G, Kersey PJ, Durbin R, Jackman SD, Chapman JA, Huang X, DeRisi JL, Caccamo M, Li Y, Jaffe DB, Green RE, Haussler D, Korf I, Paten B. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res 2011; 21:2224-41. [PMID: 21926179 DOI: 10.1101/gr.126599.111] [Citation(s) in RCA: 318] [Impact Index Per Article: 24.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Low-cost short read sequencing technology has revolutionized genomics, though it is only just becoming practical for the high-quality de novo assembly of a novel large genome. We describe the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies. In a collaborative effort, teams were asked to assemble a simulated Illumina HiSeq data set of an unknown, simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novel haplotype aware assessments of coverage, contiguity, structure, base calling, and copy number were made. We establish that within this benchmark: (1) It is possible to assemble the genome to a high level of coverage and accuracy, and that (2) large differences exist between the assemblies, suggesting room for further improvements in current methods. The simulated benchmark, including the correct answer, the assemblies, and the code that was used to evaluate the assemblies is now public and freely available from http://www.assemblathon.org/.
Collapse
Affiliation(s)
- Dent Earl
- Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California 95064, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
102
|
Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HOK, Buffalo V, Zerbino DR, Diekhans M, Nguyen N, Ariyaratne PN, Sung WK, Ning Z, Haimel M, Simpson JT, Fonseca NA, Birol İ, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelley DR, Phillippy AM, Koren S, Yang SP, Wu W, Chou WC, Srivastava A, Shaw TI, Ruby JG, Skewes-Cox P, Betegon M, Dimon MT, Solovyev V, Seledtsov I, Kosarev P, Vorobyev D, Ramirez-Gonzalez R, Leggett R, MacLean D, Xia F, Luo R, Li Z, Xie Y, Liu B, Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Yin S, Sharpe T, Hall G, Kersey PJ, Durbin R, Jackman SD, Chapman JA, Huang X, DeRisi JL, Caccamo M, Li Y, Jaffe DB, Green RE, Haussler D, Korf I, Paten B. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res 2011. [PMID: 21926179 DOI: 10.1101/gr.126599] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Low-cost short read sequencing technology has revolutionized genomics, though it is only just becoming practical for the high-quality de novo assembly of a novel large genome. We describe the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies. In a collaborative effort, teams were asked to assemble a simulated Illumina HiSeq data set of an unknown, simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novel haplotype aware assessments of coverage, contiguity, structure, base calling, and copy number were made. We establish that within this benchmark: (1) It is possible to assemble the genome to a high level of coverage and accuracy, and that (2) large differences exist between the assemblies, suggesting room for further improvements in current methods. The simulated benchmark, including the correct answer, the assemblies, and the code that was used to evaluate the assemblies is now public and freely available from http://www.assemblathon.org/.
Collapse
Affiliation(s)
- Dent Earl
- Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California 95064, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
103
|
Magoč T, Salzberg SL. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics 2011; 27:2957-63. [PMID: 21903629 DOI: 10.1093/bioinformatics/btr507] [Citation(s) in RCA: 8562] [Impact Index Per Article: 658.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
MOTIVATION Next-generation sequencing technologies generate very large numbers of short reads. Even with very deep genome coverage, short read lengths cause problems in de novo assemblies. The use of paired-end libraries with a fragment size shorter than twice the read length provides an opportunity to generate much longer reads by overlapping and merging read pairs before assembling a genome. RESULTS We present FLASH, a fast computational tool to extend the length of short reads by overlapping paired-end reads from fragment libraries that are sufficiently short. We tested the correctness of the tool on one million simulated read pairs, and we then applied it as a pre-processor for genome assemblies of Illumina reads from the bacterium Staphylococcus aureus and human chromosome 14. FLASH correctly extended and merged reads >99% of the time on simulated reads with an error rate of <1%. With adequately set parameters, FLASH correctly merged reads over 90% of the time even when the reads contained up to 5% errors. When FLASH was used to extend reads prior to assembly, the resulting assemblies had substantially greater N50 lengths for both contigs and scaffolds. AVAILABILITY AND IMPLEMENTATION The FLASH system is implemented in C and is freely available as open-source code at http://www.cbcb.umd.edu/software/flash. CONTACT t.magoc@gmail.com.
Collapse
Affiliation(s)
- Tanja Magoč
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA.
| | | |
Collapse
|
104
|
Chapman JA, Ho I, Sunkara S, Luo S, Schroth GP, Rokhsar DS. Meraculous: de novo genome assembly with short paired-end reads. PLoS One 2011; 6:e23501. [PMID: 21876754 PMCID: PMC3158087 DOI: 10.1371/journal.pone.0023501] [Citation(s) in RCA: 125] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2010] [Accepted: 07/19/2011] [Indexed: 11/18/2022] Open
Abstract
We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ∼280 bp or ∼3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed.
Collapse
Affiliation(s)
- Jarrod A Chapman
- U.S. Department of Energy Joint Genome Institute, Walnut Creek, California, United States of America.
| | | | | | | | | | | |
Collapse
|
105
|
Metagenomic insights into the evolution, function, and complexity of the planktonic microbial community of Lake Lanier, a temperate freshwater ecosystem. Appl Environ Microbiol 2011; 77:6000-11. [PMID: 21764968 DOI: 10.1128/aem.00107-11] [Citation(s) in RCA: 125] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Lake Lanier is an important freshwater lake for the southeast United States, as it represents the main source of drinking water for the Atlanta metropolitan area and is popular for recreational activities. Temperate freshwater lakes such as Lake Lanier are underrepresented among the growing number of environmental metagenomic data sets, and little is known about how functional gene content in freshwater communities relates to that of other ecosystems. To better characterize the gene content and variability of this freshwater planktonic microbial community, we sequenced several samples obtained around a strong summer storm event and during the fall water mixing using a random whole-genome shotgun (WGS) approach. Comparative metagenomics revealed that the gene content was relatively stable over time and more related to that of another freshwater lake and the surface ocean than to soil. However, the phylogenetic diversity of Lake Lanier communities was distinct from that of soil and marine communities. We identified several important genomic adaptations that account for these findings, such as the use of potassium (as opposed to sodium) osmoregulators by freshwater organisms and differences in the community average genome size. We show that the lake community is predominantly composed of sequence-discrete populations and describe a simple method to assess community complexity based on population richness and evenness and to determine the sequencing effort required to cover diversity in a sample. This study provides the first comprehensive analysis of the genetic diversity and metabolic potential of a temperate planktonic freshwater community and advances approaches for comparative metagenomics.
Collapse
|
106
|
Pignatelli M, Moya A. Evaluating the fidelity of de novo short read metagenomic assembly using simulated data. PLoS One 2011; 6:e19984. [PMID: 21625384 PMCID: PMC3100316 DOI: 10.1371/journal.pone.0019984] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2010] [Accepted: 04/22/2011] [Indexed: 02/07/2023] Open
Abstract
A frequent step in metagenomic data analysis comprises the assembly of the sequenced reads. Many assembly tools have been published in the last years targeting data coming from next-generation sequencing (NGS) technologies but these assemblers have not been designed for or tested in multi-genome scenarios that characterize metagenomic studies. Here we provide a critical assessment of current de novo short reads assembly tools in multi-genome scenarios using complex simulated metagenomic data. With this approach we tested the fidelity of different assemblers in metagenomic studies demonstrating that even under the simplest compositions the number of chimeric contigs involving different species is noticeable. We further showed that the assembly process reduces the accuracy of the functional classification of the metagenomic data and that these errors can be overcome raising the coverage of the studied metagenome. The results presented here highlight the particular difficulties that de novo genome assemblers face in multi-genome scenarios demonstrating that these difficulties, that often compromise the functional classification of the analyzed data, can be overcome with a high sequencing effort.
Collapse
Affiliation(s)
- Miguel Pignatelli
- Unitat Mixta d'Investigació en Genòmica i Salut, Centre Superior d'Investigació en Salut Pública/UVEG-Institut Cavanilles, Valencia, Spain.
| | | |
Collapse
|
107
|
Aird D, Ross MG, Chen WS, Danielsson M, Fennell T, Russ C, Jaffe DB, Nusbaum C, Gnirke A. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol 2011; 12:R18. [PMID: 21338519 PMCID: PMC3188800 DOI: 10.1186/gb-2011-12-2-r18] [Citation(s) in RCA: 757] [Impact Index Per Article: 58.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2010] [Revised: 12/23/2010] [Accepted: 02/21/2011] [Indexed: 01/18/2023] Open
Abstract
Despite the ever-increasing output of Illumina sequencing data, loci with extreme base compositions are often under-represented or absent. To evaluate sources of base-composition bias, we traced genomic sequences ranging from 6% to 90% GC through the process by quantitative PCR. We identified PCR during library preparation as a principal source of bias and optimized the conditions. Our improved protocol significantly reduces amplification bias and minimizes the previously severe effects of PCR instrument and temperature ramp rate.
Collapse
Affiliation(s)
- Daniel Aird
- Genome Sequencing and Analysis Program, Broad Institute of MIT and Harvard, 320 Charles Street, Cambridge, MA 02141, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
108
|
Aird D, Ross MG, Chen WS, Danielsson M, Fennell T, Russ C, Jaffe DB, Nusbaum C, Gnirke A. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol 2011. [PMID: 21338519 DOI: 10.1186/1465-6906-12-s1-i18] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/11/2023] Open
Abstract
Despite the ever-increasing output of Illumina sequencing data, loci with extreme base compositions are often under-represented or absent. To evaluate sources of base-composition bias, we traced genomic sequences ranging from 6% to 90% GC through the process by quantitative PCR. We identified PCR during library preparation as a principal source of bias and optimized the conditions. Our improved protocol significantly reduces amplification bias and minimizes the previously severe effects of PCR instrument and temperature ramp rate.
Collapse
Affiliation(s)
- Daniel Aird
- Genome Sequencing and Analysis Program, Broad Institute of MIT and Harvard, 320 Charles Street, Cambridge, MA 02141, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
109
|
High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci U S A 2010; 108:1513-8. [PMID: 21187386 DOI: 10.1073/pnas.1017351108] [Citation(s) in RCA: 1141] [Impact Index Per Article: 81.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Massively parallel DNA sequencing technologies are revolutionizing genomics by making it possible to generate billions of relatively short (~100-base) sequence reads at very low cost. Whereas such data can be readily used for a wide range of biomedical applications, it has proven difficult to use them to generate high-quality de novo genome assemblies of large, repeat-rich vertebrate genomes. To date, the genome assemblies generated from such data have fallen far short of those obtained with the older (but much more expensive) capillary-based sequencing approach. Here, we report the development of an algorithm for genome assembly, ALLPATHS-LG, and its application to massively parallel DNA sequence data from the human and mouse genomes, generated on the Illumina platform. The resulting draft genome assemblies have good accuracy, short-range contiguity, long-range connectivity, and coverage of the genome. In particular, the base accuracy is high (≥99.95%) and the scaffold sizes (N50 size = 11.5 Mb for human and 7.2 Mb for mouse) approach those obtained with capillary-based sequencing. The combination of improved sequencing technology and improved computational methods should now make it possible to increase dramatically the de novo sequencing of large genomes. The ALLPATHS-LG program is available at http://www.broadinstitute.org/science/programs/genome-biology/crd.
Collapse
|
110
|
Ariyaratne PN, Sung WK. PE-Assembler: de novo assembler using short paired-end reads. ACTA ACUST UNITED AC 2010; 27:167-74. [PMID: 21149345 DOI: 10.1093/bioinformatics/btq626] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
MOTIVATION Many de novo genome assemblers have been proposed recently. The basis for most existing methods relies on the de bruijn graph: a complex graph structure that attempts to encompass the entire genome. Such graphs can be prohibitively large, may fail to capture subtle information and is difficult to be parallelized. RESULT We present a method that eschews the traditional graph-based approach in favor of a simple 3' extension approach that has potential to be massively parallelized. Our results show that it is able to obtain assemblies that are more contiguous, complete and less error prone compared with existing methods. AVAILABILITY The software package can be found at http://www.comp.nus.edu.sg/~bioinfo/peasm/. Alternatively it is available from authors upon request.
Collapse
|
111
|
Crawford JE, Guelbeogo WM, Sanou A, Traoré A, Vernick KD, Sagnon N, Lazzaro BP. De novo transcriptome sequencing in Anopheles funestus using Illumina RNA-seq technology. PLoS One 2010; 5:e14202. [PMID: 21151993 PMCID: PMC2996306 DOI: 10.1371/journal.pone.0014202] [Citation(s) in RCA: 116] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2010] [Accepted: 11/10/2010] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Anopheles funestus is one of the primary vectors of human malaria, which causes a million deaths each year in sub-Saharan Africa. Few scientific resources are available to facilitate studies of this mosquito species and relatively little is known about its basic biology and evolution, making development and implementation of novel disease control efforts more difficult. The An. funestus genome has not been sequenced, so in order to facilitate genome-scale experimental biology, we have sequenced the adult female transcriptome of An. funestus from a newly founded colony in Burkina Faso, West Africa, using the Illumina GAIIx next generation sequencing platform. METHODOLOGY/PRINCIPAL FINDINGS We assembled short Illumina reads de novo using a novel approach involving iterative de novo assemblies and "target-based" contig clustering. We then selected a conservative set of 15,527 contigs through comparisons to four Dipteran transcriptomes as well as multiple functional and conserved protein domain databases. Comparison to the Anopheles gambiae immune system identified 339 contigs as putative immune genes, thus identifying a large portion of the immune system that can form the basis for subsequent studies of this important malaria vector. We identified 5,434 1:1 orthologues between An. funestus and An. gambiae and found that among these 1:1 orthologues, the protein sequence of those with putative immune function were significantly more diverged than the transcriptome as a whole. Short read alignments to the contig set revealed almost 367,000 genetic polymorphisms segregating in the An. funestus colony and demonstrated the utility of the assembled transcriptome for use in RNA-seq based measurements of gene expression. CONCLUSIONS/SIGNIFICANCE We developed a pipeline that makes de novo transcriptome sequencing possible in virtually any organism at a very reasonable cost ($6,300 in sequencing costs in our case). We anticipate that our approach could be used to develop genomic resources in a diversity of systems for which full genome sequence is currently unavailable. Our An. funestus contig set and analytical results provide a valuable resource for future studies in this non-model, but epidemiologically critical, vector insect.
Collapse
Affiliation(s)
- Jacob E Crawford
- Department of Entomology, Cornell University, Ithaca, New York, United States of America.
| | | | | | | | | | | | | |
Collapse
|
112
|
Nijkamp J, Winterbach W, van den Broek M, Daran JM, Reinders M, de Ridder D. Integrating genome assemblies with MAIA. Bioinformatics 2010; 26:i433-9. [PMID: 20823304 PMCID: PMC2935414 DOI: 10.1093/bioinformatics/btq366] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION De novo assembly of a eukaryotic genome with next-generation sequencing data is still a challenging task. Over the past few years several assemblers have been developed, often suitable for one specific type of sequencing data. The number of known genomes is expanding rapidly, therefore it becomes possible to use multiple reference genomes for assembly projects. We introduce an assembly integrator that makes use of all available data, i.e. multiple de novo assemblies and mappings against multiple related genomes, by optimizing a weighted combination of criteria. RESULTS The developed algorithm was applied on the de novo sequencing of the Saccharomyces cerevisiae CEN.PK 113-7D strain. Using Solexa and 454 read data, two de novo and three comparative assemblies were constructed and subsequently integrated, yielding 29 contigs, covering more than 12 Mbp; a drastic improvement compared with the single assemblies. AVAILABILITY MAIA is available as a Matlab package and can be downloaded from http://bioinformatics.tudelft.nl.
Collapse
Affiliation(s)
- Jurgen Nijkamp
- Department of Mediamatics, Delft University of Technology, Delft, The Netherlands.
| | | | | | | | | | | |
Collapse
|
113
|
Croucher NJ, Thomson NR. Studying bacterial transcriptomes using RNA-seq. Curr Opin Microbiol 2010; 13:619-24. [PMID: 20888288 PMCID: PMC3025319 DOI: 10.1016/j.mib.2010.09.009] [Citation(s) in RCA: 184] [Impact Index Per Article: 13.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2010] [Revised: 09/09/2010] [Accepted: 09/09/2010] [Indexed: 01/27/2023]
Abstract
Genome-wide studies of bacterial gene expression are shifting from microarray technology to second generation sequencing platforms. RNA-seq has a number of advantages over hybridization-based techniques, such as annotation-independent detection of transcription, improved sensitivity and increased dynamic range. Early studies have uncovered a wealth of novel coding sequences and non-coding RNA, and are revealing a transcriptional landscape that increasingly mirrors that of eukaryotes. Already basic RNA-seq protocols have been improved and adapted to looking at particular aspects of RNA biology, often with an emphasis on non-coding RNAs, and further refinements to current techniques will improve our understanding of gene expression, and genome content, in the future.
Collapse
Affiliation(s)
- Nicholas J Croucher
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, Cambridgeshire, CB10 1SA, UK
| | | |
Collapse
|
114
|
Abstract
Recent studies in human genomes have demonstrated the use of de novo assemblies to identify genetic variations that are difficult for mapping-based approaches. Construction of multiple human genome assemblies is enabled by massively parallel sequencing, but a conventional bioinformatics solution is costly and slow, creating bottle-necks in the process. This review describes two public short-read de novo assembly applications that can handle human genomes, ABySS and SOAPdenovo. It also discusses the technical aspects and future challenges of human genome de novo assembly by short reads.
Collapse
|
115
|
Paszkiewicz K, Studholme DJ. De novo assembly of short sequence reads. Brief Bioinform 2010; 11:457-72. [DOI: 10.1093/bib/bbq020] [Citation(s) in RCA: 134] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
116
|
Surget-Groba Y, Montoya-Burgos JI. Optimization of de novo transcriptome assembly from next-generation sequencing data. Genome Res 2010; 20:1432-40. [PMID: 20693479 DOI: 10.1101/gr.103846.109] [Citation(s) in RCA: 260] [Impact Index Per Article: 18.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Transcriptome analysis has important applications in many biological fields. However, assembling a transcriptome without a known reference remains a challenging task requiring algorithmic improvements. We present two methods for substantially improving transcriptome de novo assembly. The first method relies on the observation that the use of a single k-mer length by current de novo assemblers is suboptimal to assemble transcriptomes where the sequence coverage of transcripts is highly heterogeneous. We present the Multiple-k method in which various k-mer lengths are used for de novo transcriptome assembly. We demonstrate its good performance by assembling de novo a published next-generation transcriptome sequence data set of Aedes aegypti, using the existing genome to check the accuracy of our method. The second method relies on the use of a reference proteome to improve the de novo assembly. We developed the Scaffolding using Translation Mapping (STM) method that uses mapping against the closest available reference proteome for scaffolding contigs that map onto the same protein. In a controlled experiment using simulated data, we show that the STM method considerably improves the assembly, with few errors. We applied these two methods to assemble the transcriptome of the non-model catfish Loricaria gr. cataphracta. Using the Multiple-k and STM methods, the assembly increases in contiguity and in gene identification, showing that our methods clearly improve quality and can be widely used. The new methods were used to assemble successfully the transcripts of the core set of genes regulating tooth development in vertebrates, while classic de novo assembly failed.
Collapse
Affiliation(s)
- Yann Surget-Groba
- Department of Zoology and Animal Biology, University of Geneva, 1211 Geneva 4, Switzerland
| | | |
Collapse
|
117
|
Read length and repeat resolution: exploring prokaryote genomes using next-generation sequencing technologies. PLoS One 2010; 5:e11518. [PMID: 20634954 PMCID: PMC2902515 DOI: 10.1371/journal.pone.0011518] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2010] [Accepted: 05/31/2010] [Indexed: 11/19/2022] Open
Abstract
Background There are a growing number of next-generation sequencing technologies. At present, the most cost-effective options also produce the shortest reads. However, even for prokaryotes, there is uncertainty concerning the utility of these technologies for the de novo assembly of complete genomes. This reflects an expectation that short reads will be unable to resolve small, but presumably abundant, repeats. Methodology/Principal Findings Using a simple model of repeat assembly, we develop and test a technique that, for any read length, can estimate the occurrence of unresolvable repeats in a genome, and thus predict the number of gaps that would need to be closed to produce a complete sequence. We apply this technique to 818 prokaryote genome sequences. This provides a quantitative assessment of the relative performance of various lengths. Notably, unpaired reads of only 150nt can reconstruct approximately 50% of the analysed genomes with fewer than 96 repeat-induced gaps. Nonetheless, there is considerable variation amongst prokaryotes. Some genomes can be assembled to near contiguity using very short reads while others require much longer reads. Conclusions Given the diversity of prokaryote genomes, a sequencing strategy should be tailored to the organism under study. Our results will provide researchers with a practical resource to guide the selection of the appropriate read length.
Collapse
|
118
|
Kislyuk AO, Katz LS, Agrawal S, Hagen MS, Conley AB, Jayaraman P, Nelakuditi V, Humphrey JC, Sammons SA, Govil D, Mair RD, Tatti KM, Tondella ML, Harcourt BH, Mayer LW, Jordan IK. A computational genomics pipeline for prokaryotic sequencing projects. ACTA ACUST UNITED AC 2010; 26:1819-26. [PMID: 20519285 PMCID: PMC2905547 DOI: 10.1093/bioinformatics/btq284] [Citation(s) in RCA: 59] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION New sequencing technologies have accelerated research on prokaryotic genomes and have made genome sequencing operations outside major genome sequencing centers routine. However, no off-the-shelf solution exists for the combined assembly, gene prediction, genome annotation and data presentation necessary to interpret sequencing data. The resulting requirement to invest significant resources into custom informatics support for genome sequencing projects remains a major impediment to the accessibility of high-throughput sequence data. RESULTS We present a self-contained, automated high-throughput open source genome sequencing and computational genomics pipeline suitable for prokaryotic sequencing projects. The pipeline has been used at the Georgia Institute of Technology and the Centers for Disease Control and Prevention for the analysis of Neisseria meningitidis and Bordetella bronchiseptica genomes. The pipeline is capable of enhanced or manually assisted reference-based assembly using multiple assemblers and modes; gene predictor combining; and functional annotation of genes and gene products. Because every component of the pipeline is executed on a local machine with no need to access resources over the Internet, the pipeline is suitable for projects of a sensitive nature. Annotation of virulence-related features makes the pipeline particularly useful for projects working with pathogenic prokaryotes. AVAILABILITY AND IMPLEMENTATION The pipeline is licensed under the open-source GNU General Public License and available at the Georgia Tech Neisseria Base (http://nbase.biology.gatech.edu/). The pipeline is implemented with a combination of Perl, Bourne Shell and MySQL and is compatible with Linux and other Unix systems.
Collapse
Affiliation(s)
- Andrey O Kislyuk
- School of Biology, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
119
|
Schatz MC, Delcher AL, Salzberg SL. Assembly of large genomes using second-generation sequencing. Genome Res 2010; 20:1165-73. [PMID: 20508146 DOI: 10.1101/gr.101360.109] [Citation(s) in RCA: 280] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Second-generation sequencing technology can now be used to sequence an entire human genome in a matter of days and at low cost. Sequence read lengths, initially very short, have rapidly increased since the technology first appeared, and we now are seeing a growing number of efforts to sequence large genomes de novo from these short reads. In this Perspective, we describe the issues associated with short-read assembly, the different types of data produced by second-gen sequencers, and the latest assembly algorithms designed for these data. We also review the genomes that have been assembled recently from short reads and make recommendations for sequencing strategies that will yield a high-quality assembly.
Collapse
Affiliation(s)
- Michael C Schatz
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland 20742, USA
| | | | | |
Collapse
|
120
|
van Berkum NL, Lieberman-Aiden E, Williams L, Imakaev M, Gnirke A, Mirny LA, Dekker J, Lander ES. Hi-C: a method to study the three-dimensional architecture of genomes. J Vis Exp 2010:1869. [PMID: 20461051 PMCID: PMC3149993 DOI: 10.3791/1869] [Citation(s) in RCA: 220] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
The three-dimensional folding of chromosomes compartmentalizes the genome and and can bring distant functional elements, such as promoters and enhancers, into close spatial proximity 2-6. Deciphering the relationship between chromosome organization and genome activity will aid in understanding genomic processes, like transcription and replication. However, little is known about how chromosomes fold. Microscopy is unable to distinguish large numbers of loci simultaneously or at high resolution. To date, the detection of chromosomal interactions using chromosome conformation capture (3C) and its subsequent adaptations required the choice of a set of target loci, making genome-wide studies impossible 7-10. We developed Hi-C, an extension of 3C that is capable of identifying long range interactions in an unbiased, genome-wide fashion. In Hi-C, cells are fixed with formaldehyde, causing interacting loci to be bound to one another by means of covalent DNA-protein cross-links. When the DNA is subsequently fragmented with a restriction enzyme, these loci remain linked. A biotinylated residue is incorporated as the 5' overhangs are filled in. Next, blunt-end ligation is performed under dilute conditions that favor ligation events between cross-linked DNA fragments. This results in a genome-wide library of ligation products, corresponding to pairs of fragments that were originally in close proximity to each other in the nucleus. Each ligation product is marked with biotin at the site of the junction. The library is sheared, and the junctions are pulled-down with streptavidin beads. The purified junctions can subsequently be analyzed using a high-throughput sequencer, resulting in a catalog of interacting fragments. Direct analysis of the resulting contact matrix reveals numerous features of genomic organization, such as the presence of chromosome territories and the preferential association of small gene-rich chromosomes. Correlation analysis can be applied to the contact matrix, demonstrating that the human genome is segregated into two compartments: a less densely packed compartment containing open, accessible, and active chromatin and a more dense compartment containing closed, inaccessible, and inactive chromatin regions. Finally, ensemble analysis of the contact matrix, coupled with theoretical derivations and computational simulations, revealed that at the megabase scale Hi-C reveals features consistent with a fractal globule conformation.
Collapse
Affiliation(s)
- Nynke L van Berkum
- Program in Gene Function and Expression, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School
| | | | | | | | | | | | | | | |
Collapse
|
121
|
Tsai IJ, Otto TD, Berriman M. Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps. Genome Biol 2010; 11:R41. [PMID: 20388197 PMCID: PMC2884544 DOI: 10.1186/gb-2010-11-4-r41] [Citation(s) in RCA: 228] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2010] [Revised: 03/10/2010] [Accepted: 04/13/2010] [Indexed: 12/17/2022] Open
Abstract
IMAGE generates local assemblies, closing gaps in genomes assembled from paired-end next generation sequencing data, often without the need for new data Advances in sequencing technology allow genomes to be sequenced at vastly decreased costs. However, the assembled data frequently are highly fragmented with many gaps. We present a practical approach that uses Illumina sequences to improve draft genome assemblies by aligning sequences against contig ends and performing local assemblies to produce gap-spanning contigs. The continuity of a draft genome can thus be substantially improved, often without the need to generate new data.
Collapse
Affiliation(s)
- Isheng J Tsai
- Parasite Genomics, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.
| | | | | |
Collapse
|
122
|
Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics 2010; 95:315-27. [PMID: 20211242 DOI: 10.1016/j.ygeno.2010.03.001] [Citation(s) in RCA: 632] [Impact Index Per Article: 45.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2009] [Revised: 02/26/2010] [Accepted: 03/02/2010] [Indexed: 01/08/2023]
Abstract
The emergence of next-generation sequencing platforms led to resurgence of research in whole-genome shotgun assembly algorithms and software. DNA sequencing data from the Roche 454, Illumina/Solexa, and ABI SOLiD platforms typically present shorter read lengths, higher coverage, and different error profiles compared with Sanger sequencing data. Since 2005, several assembly software packages have been created or revised specifically for de novo assembly of next-generation sequencing data. This review summarizes and compares the published descriptions of packages named SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo. More generally, it compares the two standard methods known as the de Bruijn graph approach and the overlap/layout/consensus approach to assembly.
Collapse
Affiliation(s)
- Jason R Miller
- J. Craig Venter Institute, Rockville, MD 20850-3343, USA.
| | | | | |
Collapse
|
123
|
Abstract
As our ability to generate sequencing data continues to increase, data analysis is replacing data generation as the rate-limiting step in genomics studies. Here we provide a guide to genomic data visualization tools that facilitate analysis tasks by enabling researchers to explore, interpret and manipulate their data, and in some cases perform on-the-fly computations. We will discuss graphical methods designed for the analysis of de novo sequencing assemblies and read alignments, genome browsing, and comparative genomics, highlighting the strengths and limitations of these approaches and the challenges ahead.
Collapse
|