7001
|
Peng Y, Leung HCM, Yiu SM, Chin FYL. IDBA – A Practical Iterative de Bruijn Graph De Novo Assembler. LECTURE NOTES IN COMPUTER SCIENCE 2010. [DOI: 10.1007/978-3-642-12683-3_28] [Citation(s) in RCA: 159] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
|
7002
|
Abstract
Several sequencing technologies have been introduced in recent years that dramatically outperform the traditional Sanger technology in terms of throughput and cost. The data generated by these technologies are characterized by generally shorter read lengths (as low as 35 bp) and different error characteristics than Sanger data. Existing software tools for assembly and analysis of sequencing data are, therefore, ill-suited to handle the new types of data generated. This paper surveys the recent software packages aimed specifically at analyzing new generation sequencing data.
Collapse
Affiliation(s)
- Niranjan Nagarajan
- Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies and Department of Computer Science, University of Maryland, College Park, MD, USA
| | | |
Collapse
|
7003
|
Abstract
Whole genome sequencing provides the most comprehensive collection of an individual's genetic variation. With the falling costs of sequencing technology, we envision paradigm shift from microarray-based genotyping studies to whole genome sequencing. We review methodologies for whole genome sequencing. There are two approaches for assembling short shotgun sequence reads into longer contiguous genomic sequences. In the de novo assembly approach, sequence reads are compared to each other, and then overlapped to build longer contiguous sequences. The reference-based assembly approach involves mapping each read to a reference genome sequence. We discuss methods for identifying genetic variation (single nucleotide polymorphisms, small indels, and copy number variants) and building haplotypes from genome assemblies, and discuss potential pitfalls. We expect methodologies to evolve rapidly as sequencing technologies improve and more human genomes are sequenced.
Collapse
Affiliation(s)
- Pauline C Ng
- The J. Craig Venter Institute, Rockville, MD, USA
| | | |
Collapse
|
7004
|
Toward next-generation sequencing of mitochondrial genomes — Focus on parasitic worms of animals and biotechnological implications. Biotechnol Adv 2010; 28:151-9. [DOI: 10.1016/j.biotechadv.2009.11.002] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2009] [Revised: 10/28/2009] [Accepted: 11/04/2009] [Indexed: 11/21/2022]
|
7005
|
Milos PM. Emergence of single-molecule sequencing and potential for molecular diagnostic applications. Expert Rev Mol Diagn 2009; 9:659-66. [PMID: 19817551 DOI: 10.1586/erm.09.50] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
The effective demonstration of single-molecule sequencing at scale over the last several years offers the exciting opportunity for a new era in the field of molecular diagnostics. As we aim to personalize and deliver cost-effective healthcare, we must consider the need to fully integrate genomics into decision-making. We must be able to accurately and cost effectively obtain a complete genome sequence for disease diagnosis, interrogate a molecular signature from blood for therapeutic monitoring, obtain a tumor mutation profile for optimizing therapeutic choice - each molecular diagnostic measurement utilized to better inform patient care. Would a physician or molecular pathology laboratory want to utilize a PCR process in which millions of DNA copies of a patient's nucleic acid are created when an alternative approach allowing direct measurement of the nucleic acids is possible? I would suggest not! In this article we will focus on the emergence of single-molecule sequencing, the single-molecule sequencing methodologies in the marketplace or under development today, as well as the importance of these methods for molecular characterization and diagnosis of disease with the ultimate application for molecular diagnostics.
Collapse
Affiliation(s)
- Patrice M Milos
- Helicos BioSciences, 1 Kendall Square, Building 700, Cambridge, MA 02139, USA.
| |
Collapse
|
7006
|
Zerbino DR, McEwen GK, Margulies EH, Birney E. Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short-read de novo assembler. PLoS One 2009; 4:e8407. [PMID: 20027311 PMCID: PMC2793427 DOI: 10.1371/journal.pone.0008407] [Citation(s) in RCA: 150] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2009] [Accepted: 10/21/2009] [Indexed: 11/22/2022] Open
Abstract
Background Despite the short length of their reads, micro-read sequencing technologies have shown their usefulness for de novo sequencing. However, especially in eukaryotic genomes, complex repeat patterns are an obstacle to large assemblies. Principal Findings We present a novel heuristic algorithm, Pebble, which uses paired-end read information to resolve repeats and scaffold contigs to produce large-scale assemblies. In simulations, we can achieve weighted median scaffold lengths (N50) of above 1 Mbp in Bacteria and above 100 kbp in more complex organisms. Using real datasets we obtained a 96 kbp N50 in Pseudomonas syringae and a unique 147 kbp scaffold of a ferret BAC clone. We also present an efficient algorithm called Rock Band for the resolution of repeats in the case of mixed length assemblies, where different sequencing platforms are combined to obtain a cost-effective assembly. Conclusions These algorithms extend the utility of short read only assemblies into large complex genomes. They have been implemented and made available within the open-source Velvet short-read de novo assembler.
Collapse
Affiliation(s)
- Daniel R Zerbino
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.
| | | | | | | |
Collapse
|
7007
|
Abstract
Pseudomonas aeruginosa PAO1 is the most commonly used strain for research on this ubiquitous and metabolically versatile opportunistic pathogen. Strain PAO1, a derivative of the original Australian PAO isolate, has been distributed worldwide to laboratories and strain collections. Over decades discordant phenotypes of PAO1 sublines have emerged. Taking the existing PAO1-UW genome sequence (named after the University of Washington, which led the sequencing project) as a blueprint, the genome sequences of reference strains MPAO1 and PAO1-DSM (stored at the German Collection for Microorganisms and Cell Cultures [DSMZ]) were resolved by physical mapping and deep short read sequencing-by-synthesis. MPAO1 has been the source of near-saturation libraries of transposon insertion mutants, and PAO1-DSM is identical in its SpeI-DpnI restriction map with the original isolate. The major genomic differences of MPAO1 and PAO1-DSM in comparison to PAO1-UW are the lack of a large inversion, a duplication of a mobile 12-kb prophage region carrying a distinct integrase and protein phosphatases or kinases, deletions of 3 to 1,006 bp in size, and at least 39 single-nucleotide substitutions, 17 of which affect protein sequences. The PAO1 sublines differed in their ability to cope with nutrient limitation and their virulence in an acute murine airway infection model. Subline PAO1-DSM outnumbered the two other sublines in late stationary growth phase. In conclusion, P. aeruginosa PAO1 shows an ongoing microevolution of genotype and phenotype that jeopardizes the reproducibility of research. High-throughput genome resequencing will resolve more cases and could become a proper quality control for strain collections.
Collapse
|
7008
|
Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 2009; 20:265-72. [PMID: 20019144 DOI: 10.1101/gr.097261.109] [Citation(s) in RCA: 2099] [Impact Index Per Article: 139.9] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Next-generation massively parallel DNA sequencing technologies provide ultrahigh throughput at a substantially lower unit data cost; however, the data are very short read length sequences, making de novo assembly extremely challenging. Here, we describe a novel method for de novo assembly of large genomes from short read sequences. We successfully assembled both the Asian and African human genome sequences, achieving an N50 contig size of 7.4 and 5.9 kilobases (kb) and scaffold of 446.3 and 61.9 kb, respectively. The development of this de novo short read assembly method creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost-effective way.
Collapse
Affiliation(s)
- Ruiqiang Li
- Beijing Genomics Institute at Shenzhen, Shenzhen 518083, China
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
7009
|
Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 2009; 38:1767-71. [PMID: 20015970 PMCID: PMC2847217 DOI: 10.1093/nar/gkp1137] [Citation(s) in RCA: 968] [Impact Index Per Article: 64.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
FASTQ has emerged as a common file format for sharing sequencing read data combining both the sequence and an associated per base quality score, despite lacking any formal definition to date, and existing in at least three incompatible variants. This article defines the FASTQ format, covering the original Sanger standard, the Solexa/Illumina variants and conversion between them, based on publicly available information such as the MAQ documentation and conventions recently agreed by the Open Bioinformatics Foundation projects Biopython, BioPerl, BioRuby, BioJava and EMBOSS. Being an open access publication, it is hoped that this description, with the example files provided as Supplementary Data, will serve in future as a reference for this important file format.
Collapse
|
7010
|
|
7011
|
Parks M, Cronn R, Liston A. Increasing phylogenetic resolution at low taxonomic levels using massively parallel sequencing of chloroplast genomes. BMC Biol 2009; 7:84. [PMID: 19954512 PMCID: PMC2793254 DOI: 10.1186/1741-7007-7-84] [Citation(s) in RCA: 358] [Impact Index Per Article: 23.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2009] [Accepted: 12/02/2009] [Indexed: 11/12/2022] Open
Abstract
BACKGROUND Molecular evolutionary studies share the common goal of elucidating historical relationships, and the common challenge of adequately sampling taxa and characters. Particularly at low taxonomic levels, recent divergence, rapid radiations, and conservative genome evolution yield limited sequence variation, and dense taxon sampling is often desirable. Recent advances in massively parallel sequencing make it possible to rapidly obtain large amounts of sequence data, and multiplexing makes extensive sampling of megabase sequences feasible. Is it possible to efficiently apply massively parallel sequencing to increase phylogenetic resolution at low taxonomic levels? RESULTS We reconstruct the infrageneric phylogeny of Pinus from 37 nearly-complete chloroplast genomes (average 109 kilobases each of an approximately 120 kilobase genome) generated using multiplexed massively parallel sequencing. 30/33 ingroup nodes resolved with > or = 95% bootstrap support; this is a substantial improvement relative to prior studies, and shows massively parallel sequencing-based strategies can produce sufficient high quality sequence to reach support levels originally proposed for the phylogenetic bootstrap. Resampling simulations show that at least the entire plastome is necessary to fully resolve Pinus, particularly in rapidly radiating clades. Meta-analysis of 99 published infrageneric phylogenies shows that whole plastome analysis should provide similar gains across a range of plant genera. A disproportionate amount of phylogenetic information resides in two loci (ycf1, ycf2), highlighting their unusual evolutionary properties. CONCLUSION Plastome sequencing is now an efficient option for increasing phylogenetic resolution at lower taxonomic levels in plant phylogenetic and population genetic analyses. With continuing improvements in sequencing capacity, the strategies herein should revolutionize efforts requiring dense taxon and character sampling, such as phylogeographic analyses and species-level DNA barcoding.
Collapse
Affiliation(s)
- Matthew Parks
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, 97331, USA
| | - Richard Cronn
- Pacific Northwest Research Station, USDA Forest Service, Corvallis, OR, 97331, USA
| | - Aaron Liston
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, 97331, USA
| |
Collapse
|
7012
|
Sánchez CC, Smith TPL, Wiedmann RT, Vallejo RL, Salem M, Yao J, Rexroad CE. Single nucleotide polymorphism discovery in rainbow trout by deep sequencing of a reduced representation library. BMC Genomics 2009; 10:559. [PMID: 19939274 PMCID: PMC2790473 DOI: 10.1186/1471-2164-10-559] [Citation(s) in RCA: 94] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2009] [Accepted: 11/25/2009] [Indexed: 11/21/2022] Open
Abstract
Background To enhance capabilities for genomic analyses in rainbow trout, such as genomic selection, a large suite of polymorphic markers that are amenable to high-throughput genotyping protocols must be identified. Expressed Sequence Tags (ESTs) have been used for single nucleotide polymorphism (SNP) discovery in salmonids. In those strategies, the salmonid semi-tetraploid genomes often led to assemblies of paralogous sequences and therefore resulted in a high rate of false positive SNP identification. Sequencing genomic DNA using primers identified from ESTs proved to be an effective but time consuming methodology of SNP identification in rainbow trout, therefore not suitable for high throughput SNP discovery. In this study, we employed a high-throughput strategy that used pyrosequencing technology to generate data from a reduced representation library constructed with genomic DNA pooled from 96 unrelated rainbow trout that represent the National Center for Cool and Cold Water Aquaculture (NCCCWA) broodstock population. Results The reduced representation library consisted of 440 bp fragments resulting from complete digestion with the restriction enzyme HaeIII; sequencing produced 2,000,000 reads providing an average 6 fold coverage of the estimated 150,000 unique genomic restriction fragments (300,000 fragment ends). Three independent data analyses identified 22,022 to 47,128 putative SNPs on 13,140 to 24,627 independent contigs. A set of 384 putative SNPs, randomly selected from the sets produced by the three analyses were genotyped on individual fish to determine the validation rate of putative SNPs among analyses, distinguish apparent SNPs that actually represent paralogous loci in the tetraploid genome, examine Mendelian segregation, and place the validated SNPs on the rainbow trout linkage map. Approximately 48% (183) of the putative SNPs were validated; 167 markers were successfully incorporated into the rainbow trout linkage map. In addition, 2% of the sequences from the validated markers were associated with rainbow trout transcripts. Conclusion The use of reduced representation libraries and pyrosequencing technology proved to be an effective strategy for the discovery of a high number of putative SNPs in rainbow trout; however, modifications to the technique to decrease the false discovery rate resulting from the evolutionary recent genome duplication would be desirable.
Collapse
|
7013
|
Imelfort M, Edwards D. De novo sequencing of plant genomes using second-generation technologies. Brief Bioinform 2009; 10:609-18. [DOI: 10.1093/bib/bbp039] [Citation(s) in RCA: 84] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
|
7014
|
Bickel PJ, Brown JB, Huang H, Li Q. An overview of recent developments in genomics and associated statistical methods. PHILOSOPHICAL TRANSACTIONS. SERIES A, MATHEMATICAL, PHYSICAL, AND ENGINEERING SCIENCES 2009; 367:4313-37. [PMID: 19805447 DOI: 10.1098/rsta.2009.0164] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
The landscape of genomics has changed drastically in the last two decades. Increasingly inexpensive sequencing has shifted the primary focus from the acquisition of biological sequences to the study of biological function. Assays have been developed to study many intricacies of biological systems, and publicly available databases have given rise to integrative analyses that combine information from many sources to draw complex conclusions. Such research was the focus of the recent workshop at the Isaac Newton Institute, 'High dimensional statistics in biology'. Many computational methods from modern genomics and related disciplines were presented and discussed. Using, as much as possible, the material from these talks, we give an overview of modern genomics: from the essential assays that make data-generation possible, to the statistical methods that yield meaningful inference. We point to current analytical challenges, where novel methods, or novel applications of extant methods, are presently needed.
Collapse
Affiliation(s)
- Peter J Bickel
- Department of Statistics University of California, Berkeley, CA, USA
| | | | | | | |
Collapse
|
7015
|
Abstract
Genome-wide measurements of protein-DNA interactions and transcriptomes are increasingly done by deep DNA sequencing methods (ChIP-seq and RNA-seq). The power and richness of these counting-based measurements comes at the cost of routinely handling tens to hundreds of millions of reads. Whereas early adopters necessarily developed their own custom computer code to analyze the first ChIP-seq and RNA-seq datasets, a new generation of more sophisticated algorithms and software tools are emerging to assist in the analysis phase of these projects. Here we describe the multilayered analyses of ChIP-seq and RNA-seq datasets, discuss the software packages currently available to perform tasks at each layer and describe some upcoming challenges and features for future analysis tools. We also discuss how software choices and uses are affected by specific aspects of the underlying biology and data structure, including genome size, positional clustering of transcription factor binding sites, transcript discovery and expression quantification.
Collapse
Affiliation(s)
- Shirley Pepke
- Center for Advanced Computing Research, California Institute of Technology, Pasadena, California, USA
| | | | | |
Collapse
|
7016
|
Leinonen R, Akhtar R, Birney E, Bonfield J, Bower L, Corbett M, Cheng Y, Demiralp F, Faruque N, Goodgame N, Gibson R, Hoad G, Hunter C, Jang M, Leonard S, Lin Q, Lopez R, Maguire M, McWilliam H, Plaister S, Radhakrishnan R, Sobhany S, Slater G, Ten Hoopen P, Valentin F, Vaughan R, Zalunin V, Zerbino D, Cochrane G. Improvements to services at the European Nucleotide Archive. Nucleic Acids Res 2009; 38:D39-45. [PMID: 19906712 PMCID: PMC2808951 DOI: 10.1093/nar/gkp998] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena) is Europe’s primary nucleotide sequence archival resource, safeguarding open nucleotide data access, engaging in worldwide collaborative data exchange and integrating with the scientific publication process. ENA has made significant contributions to the collaborative nucleotide archival arena as an active proponent of extending the traditional collaboration to cover capillary and next-generation sequencing information. We have continued to co-develop data and metadata representation formats with our collaborators for both data exchange and public data dissemination. In addition to the DDBJ/EMBL/GenBank feature table format, we share metadata formats for capillary and next-generation sequencing traces and are using and contributing to the NCBI SRA Toolkit for the long-term storage of the next-generation sequence traces. During the course of 2009, ENA has significantly improved sequence submission, search and access functionalities provided at EMBL–EBI. In this article, we briefly describe the content and scope of our archive and introduce major improvements to our services.
Collapse
Affiliation(s)
- Rasko Leinonen
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
7017
|
Abstract
Whole genome shotgun assembly is the process of taking many short sequenced segments (reads) and reconstructing the genome from which they originated. We demonstrate how the technique of bidirected network flow can be used to explicitly model the double-stranded nature of DNA for genome assembly. By combining an algorithm for the Chinese Postman Problem on bidirected graphs with the construction of a bidirected de Bruijn graph, we are able to find the shortest double-stranded DNA sequence that contains a given set of k-long DNA molecules. This is the first exact polynomial time algorithm for the assembly of a double-stranded genome. Furthermore, we propose a maximum likelihood framework for assembling the genome that is the most likely source of the reads, in lieu of the standard maximum parsimony approach (which finds the shortest genome subject to some constraints). In this setting, we give a bidirected network flow-based algorithm that, by taking advantage of high coverage, accurately estimates the copy counts of repeats in a genome. Our second algorithm combines these predicted copy counts with matepair data in order to assemble the reads into contigs. We run our algorithms on simulated read data from Escherichia coli and predict copy counts with extremely high accuracy, while assembling long contigs.
Collapse
Affiliation(s)
- Paul Medvedev
- Department of Computer Science, University of Toronto , Toronto, Canada
| | | |
Collapse
|
7018
|
Nielsen CB, Jackman SD, Birol I, Jones SJM. ABySS-Explorer: visualizing genome sequence assemblies. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2009; 15:881-8. [PMID: 19834150 DOI: 10.1109/tvcg.2009.116] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/07/2023]
Abstract
One bottleneck in large-scale genome sequencing projects is reconstructing the full genome sequence from the short subsequences produced by current technologies. The final stages of the genome assembly process inevitably require manual inspection of data inconsistencies and could be greatly aided by visualization. This paper presents our design decisions in translating key data features identified through discussions with analysts into a concise visual encoding. Current visualization tools in this domain focus on local sequence errors making high-level inspection of the assembly difficult if not impossible. We present a novel interactive graph display, ABySS-Explorer, that emphasizes the global assembly structure while also integrating salient data features such as sequence length. Our tool replaces manual and in some cases pen-and-paper based analysis tasks, and we discuss how user feedback was incorporated into iterative design refinements. Finally, we touch on applications of this representation not initially considered in our design phase, suggesting the generality of this encoding for DNA sequence data.
Collapse
|
7019
|
Marguerat S, Bähler J. RNA-seq: from technology to biology. CELLULAR AND MOLECULAR LIFE SCIENCES : CMLS 2009. [PMID: 19859660 DOI: 10.1007/s00018‐009‐0180‐6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Next-generation sequencing technologies are now being exploited not only to analyse static genomes, but also dynamic transcriptomes in an approach termed RNA-seq. Although these powerful and rapidly evolving technologies have only been available for a couple of years, they are already making substantial contributions to our understanding of genome expression and regulation. Here, we briefly describe technical issues accompanying RNA-seq data generation and analysis, highlighting differences to array-based approaches. We then review recent biological insight gained from applying RNA-seq and related approaches to deeply sample transcriptomes in different cell types or physiological conditions. These approaches are providing fascinating information about transcriptional and post-transcriptional gene regulation, and they are also giving unique insight into the richness of transcript structures and processing on a global scale and at unprecedented resolution.
Collapse
Affiliation(s)
- Samuel Marguerat
- Department of Genetics, Evolution and Environment, UCL Cancer Institute, University College London, Darwin Building, Gower Street, London WC1E 6BT, UK
| | | |
Collapse
|
7020
|
Horner DS, Pavesi G, Castrignano T, De Meo PD, Liuni S, Sammeth M, Picardi E, Pesole G. Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing. Brief Bioinform 2009; 11:181-97. [DOI: 10.1093/bib/bbp046] [Citation(s) in RCA: 111] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
|
7021
|
Arner E, Hayashizaki Y, Daub CO. NGSView: an extensible open source editor for next-generation sequencing data. Bioinformatics 2009; 26:125-6. [PMID: 19855106 PMCID: PMC2796816 DOI: 10.1093/bioinformatics/btp611] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Summary:High-throughput sequencing technologies introduce novel demands on tools available for data analysis. We have developed NGSView (Next Generation Sequence View), a generally applicable, flexible and extensible next-generation sequence alignment editor. The software allows for visualization and manipulation of millions of sequences simultaneously on a desktop computer, through a graphical interface. NGSView is available under an open source license and can be extended through a well documented API. Availability:http://ngsview.sourceforge.net Contact:arner@gsc.riken.jp
Collapse
Affiliation(s)
- Erik Arner
- RIKEN Omics Science Center, RIKEN Yokohama Institute 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan.
| | | | | |
Collapse
|
7022
|
Kerstens HHD, Crooijmans RPMA, Veenendaal A, Dibbits BW, Chin-A-Woeng TFC, den Dunnen JT, Groenen MAM. Large scale single nucleotide polymorphism discovery in unsequenced genomes using second generation high throughput sequencing technology: applied to turkey. BMC Genomics 2009; 10:479. [PMID: 19835600 PMCID: PMC2772860 DOI: 10.1186/1471-2164-10-479] [Citation(s) in RCA: 64] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2009] [Accepted: 10/16/2009] [Indexed: 01/18/2023] Open
Abstract
BACKGROUND The development of second generation sequencing methods has enabled large scale DNA variation studies at moderate cost. For the high throughput discovery of single nucleotide polymorphisms (SNPs) in species lacking a sequenced reference genome, we set-up an analysis pipeline based on a short read de novo sequence assembler and a program designed to identify variation within short reads. To illustrate the potential of this technique, we present the results obtained with a randomly sheared, enzymatically generated, 2-3 kbp genome fraction of six pooled Meleagris gallopavo (turkey) individuals. RESULTS A total of 100 million 36 bp reads were generated, representing approximately 5-6% (approximately 62 Mbp) of the turkey genome, with an estimated sequence depth of 58. Reads consisting of bases called with less than 1% error probability were selected and assembled into contigs. Subsequently, high throughput discovery of nucleotide variation was performed using sequences with more than 90% reliability by using the assembled contigs that were 50 bp or longer as the reference sequence. We identified more than 7,500 SNPs with a high probability of representing true nucleotide variation in turkeys. Increasing the reference genome by adding publicly available turkey BAC-end sequences increased the number of SNPs to over 11,000. A comparison with the sequenced chicken genome indicated that the assembled turkey contigs were distributed uniformly across the turkey genome. Genotyping of a representative sample of 340 SNPs resulted in a SNP conversion rate of 95%. The correlation of the minor allele count (MAC) and observed minor allele frequency (MAF) for the validated SNPs was 0.69. CONCLUSION We provide an efficient and cost-effective approach for the identification of thousands of high quality SNPs in species currently lacking a sequenced genome and applied this to turkey. The methodology addresses a random fraction of the genome, resulting in an even distribution of SNPs across the targeted genome.
Collapse
Affiliation(s)
- Hindrik H D Kerstens
- Animal Breeding and Genomics Center, Wageningen University, Marijkeweg 40, Wageningen, 6709 PG, the Netherlands.
| | | | | | | | | | | | | |
Collapse
|
7023
|
|
7024
|
Sudbery I, Stalker J, Simpson JT, Keane T, Rust AG, Hurles ME, Walter K, Lynch D, Teboul L, Brown SD, Li H, Ning Z, Nadeau JH, Croniger CM, Durbin R, Adams DJ. Deep short-read sequencing of chromosome 17 from the mouse strains A/J and CAST/Ei identifies significant germline variation and candidate genes that regulate liver triglyceride levels. Genome Biol 2009; 10:R112. [PMID: 19825173 PMCID: PMC2784327 DOI: 10.1186/gb-2009-10-10-r112] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2009] [Revised: 08/26/2009] [Accepted: 10/13/2009] [Indexed: 11/10/2022] Open
Abstract
Genome sequences are essential tools for comparative and mutational analyses. Here we present the short read sequence of mouse chromosome 17 from the Mus musculus domesticus derived strain A/J, and the Mus musculus castaneus derived strain CAST/Ei. We describe approaches for the accurate identification of nucleotide and structural variation in the genomes of vertebrate experimental organisms, and show how these techniques can be applied to help prioritize candidate genes within quantitative trait loci.
Collapse
Affiliation(s)
- Ian Sudbery
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1HH, UK
| | - Jim Stalker
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1HH, UK
| | - Jared T Simpson
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1HH, UK
| | - Thomas Keane
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1HH, UK
| | - Alistair G Rust
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1HH, UK
| | - Matthew E Hurles
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1HH, UK
| | - Klaudia Walter
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1HH, UK
| | - Dee Lynch
- Mammalian Genetics Unit, MRC-Harwell, Harwell Science and Innovation Campus, Oxfordshire, OX11 ORD, UK
| | - Lydia Teboul
- Mammalian Genetics Unit, MRC-Harwell, Harwell Science and Innovation Campus, Oxfordshire, OX11 ORD, UK
| | - Steve D Brown
- Mammalian Genetics Unit, MRC-Harwell, Harwell Science and Innovation Campus, Oxfordshire, OX11 ORD, UK
| | - Heng Li
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1HH, UK
| | - Zemin Ning
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1HH, UK
| | - Joseph H Nadeau
- Department of Genetics, Case Western Reserve University, Adelbert Rd, Cleveland, OH 44106-4955. USA
| | - Colleen M Croniger
- Department of Genetics, Case Western Reserve University, Adelbert Rd, Cleveland, OH 44106-4955. USA
| | - Richard Durbin
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1HH, UK
| | - David J Adams
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1HH, UK
| |
Collapse
|
7025
|
Zhou X, Su Z, Sammons RD, Peng Y, Tranel PJ, Stewart CN, Yuan JS. Novel software package for cross-platform transcriptome analysis (CPTRA). BMC Bioinformatics 2009; 10 Suppl 11:S16. [PMID: 19811681 PMCID: PMC3226187 DOI: 10.1186/1471-2105-10-s11-s16] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Background Next-generation sequencing techniques enable several novel transcriptome profiling approaches. Recent studies indicated that digital gene expression profiling based on short sequence tags has superior performance as compared to other transcriptome analysis platforms including microarrays. However, the transcriptomic analysis with tag-based methods often depends on available genome sequence. The use of tag-based methods in species without genome sequence should be complemented by other methods such as cDNA library sequencing. The combination of different next generation sequencing techniques like 454 pyrosequencing and Illumina Genome Analyzer (Solexa) will enable high-throughput and accurate global gene expression profiling in species with limited genome information. The combination of transcriptome data acquisition methods requires cross-platform transcriptome data analysis platforms, including a new software package for data processing. Results Here we presented a software package, CPTRA: Cross-Platform TRanscriptome Analysis, to analyze transcriptome profiling data from separate methods. The software package is available at http://people.tamu.edu/~syuan/cptra/cptra.html. It was applied to the case study of non-target site glyphosate resistance in horseweed; and the data was mined to discover resistance target gene(s). For the software, the input data included a long-read sequence dataset with proper annotation, and a short-read sequence tag dataset for the quantification of transcripts. By combining the two datasets, the software carries out the unique sequence tag identification, tag counting for transcript quantification, and cross-platform sequence matching functions, whereby the short sequence tags can be annotated with a function, level of expression, and Gene Ontology (GO) classification. Multiple sequence search algorithms were implemented and compared. The analysis highlighted the importance of transport genes in glyphosate resistance and identified several candidate genes for down-stream analysis. Conclusion CPTRA is a powerful software package for next generation sequencing-based transcriptome profiling in species with limited genome information. According to our case study, the strategy can greatly broaden the application of the next generation sequencing for transcriptome analysis in species without reference genome sequence.
Collapse
Affiliation(s)
- Xin Zhou
- Institute of Plant Genomics and Biotechnology, Texas A&M University, College Station, TX, USA
| | | | | | | | | | | | | |
Collapse
|
7026
|
Maccallum I, Przybylski D, Gnerre S, Burton J, Shlyakhter I, Gnirke A, Malek J, McKernan K, Ranade S, Shea TP, Williams L, Young S, Nusbaum C, Jaffe DB. ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads. Genome Biol 2009; 10:R103. [PMID: 19796385 PMCID: PMC2784318 DOI: 10.1186/gb-2009-10-10-r103] [Citation(s) in RCA: 141] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2009] [Revised: 08/20/2009] [Accepted: 10/01/2009] [Indexed: 11/10/2022] Open
Abstract
Allpaths2, a method for accurately assembling small genomes with high continuity using short paired reads. We demonstrate that genome sequences approaching finished quality can be generated from short paired reads. Using 36 base (fragment) and 26 base (jumping) reads from five microbial genomes of varied GC composition and sizes up to 40 Mb, ALLPATHS2 generated assemblies with long, accurate contigs and scaffolds. Velvet and EULER-SR were less accurate. For example, for Escherichia coli, the fraction of 10-kb stretches that were perfect was 99.8% (ALLPATHS2), 68.7% (Velvet), and 42.1% (EULER-SR).
Collapse
Affiliation(s)
- Iain Maccallum
- Broad Institute of MIT and Harvard, Charles Street, Cambridge, MA 02141, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
7027
|
Nagarajan N, Pop M. Parametric complexity of sequence assembly: theory and applications to next generation sequencing. J Comput Biol 2009; 16:897-908. [PMID: 19580519 DOI: 10.1089/cmb.2009.0005] [Citation(s) in RCA: 56] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
In recent years, a flurry of new DNA sequencing technologies have altered the landscape of genomics, providing a vast amount of sequence information at a fraction of the costs that were previously feasible. The task of assembling these sequences into a genome has, however, still remained an algorithmic challenge that is in practice answered by heuristic solutions. In order to design better assembly algorithms and exploit the characteristics of sequence data from new technologies, we need an improved understanding of the parametric complexity of the assembly problem. In this article, we provide a first theoretical study in this direction, exploring the connections between repeat complexity, read lengths, overlap lengths and coverage in determining the "hard" instances of the assembly problem. Our work suggests at least two ways in which existing assemblers can be extended in a rigorous fashion, in addition to delineating directions for future theoretical investigations.
Collapse
Affiliation(s)
- Niranjan Nagarajan
- Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, University of Maryland, College Park, Maryland 20742, USA
| | | |
Collapse
|
7028
|
Stabler RA, He M, Dawson L, Martin M, Valiente E, Corton C, Lawley TD, Sebaihia M, Quail MA, Rose G, Gerding DN, Gibert M, Popoff MR, Parkhill J, Dougan G, Wren BW. Comparative genome and phenotypic analysis of Clostridium difficile 027 strains provides insight into the evolution of a hypervirulent bacterium. Genome Biol 2009; 10:R102. [PMID: 19781061 PMCID: PMC2768977 DOI: 10.1186/gb-2009-10-9-r102] [Citation(s) in RCA: 364] [Impact Index Per Article: 24.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2009] [Revised: 06/29/2009] [Accepted: 09/25/2009] [Indexed: 11/10/2022] Open
Abstract
A genome comparison of non-epidemic and epidemic strains of Clostridium difficile reveals gene gains that could explain how a hypervirulent strain has emerged Background The continued rise of Clostridium difficile infections worldwide has been accompanied by the rapid emergence of a highly virulent clone designated PCR-ribotype 027. To understand more about the evolution of this virulent clone, we made a three-way genomic and phenotypic comparison of an 'historic' non-epidemic 027 C. difficile (CD196), a recent epidemic and hypervirulent 027 (R20291) and a previously sequenced PCR-ribotype 012 strain (630). Results Although the genomes are highly conserved, the 027 genomes have 234 additional genes compared to 630, which may contribute to the distinct phenotypic differences we observe between these strains relating to motility, antibiotic resistance and toxicity. The epidemic 027 strain has five unique genetic regions, absent from both the non-epidemic 027 and strain 630, which include a novel phage island, a two component regulatory system and transcriptional regulators. Conclusions A comparison of a series of 027 isolates showed that some of these genes appeared to have been gained by 027 strains over the past two decades. This study provides genetic markers for the identification of 027 strains and offers a unique opportunity to explain the recent emergence of a hypervirulent bacterium.
Collapse
Affiliation(s)
- Richard A Stabler
- London School of Hygiene and Tropical Medicine, Keppel Street, London, WC1E 7HT, UK.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
7029
|
Diguistini S, Liao NY, Platt D, Robertson G, Seidel M, Chan SK, Docking TR, Birol I, Holt RA, Hirst M, Mardis E, Marra MA, Hamelin RC, Bohlmann J, Breuil C, Jones SJ. De novo genome sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence data. Genome Biol 2009; 10:R94. [PMID: 19747388 PMCID: PMC2768983 DOI: 10.1186/gb-2009-10-9-r94] [Citation(s) in RCA: 123] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2009] [Accepted: 09/11/2009] [Indexed: 12/16/2022] Open
Abstract
A method for de novo assembly of a eukaryotic genome using Illumina, 454 and Sanger generated sequence data Sequencing-by-synthesis technologies can reduce the cost of generating de novo genome assemblies. We report a method for assembling draft genome sequences of eukaryotic organisms that integrates sequence information from different sources, and demonstrate its effectiveness by assembling an approximately 32.5 Mb draft genome sequence for the forest pathogen Grosmannia clavigera, an ascomycete fungus. We also developed a method for assessing draft assemblies using Illumina paired end read data and demonstrate how we are using it to guide future sequence finishing. Our results demonstrate that eukaryotic genome sequences can be accurately assembled by combining Illumina, 454 and Sanger sequence data.
Collapse
Affiliation(s)
- Scott Diguistini
- Department of Wood Science, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
7030
|
Rodrigue S, Malmstrom RR, Berlin AM, Birren BW, Henn MR, Chisholm SW. Whole genome amplification and de novo assembly of single bacterial cells. PLoS One 2009; 4:e6864. [PMID: 19724646 PMCID: PMC2731171 DOI: 10.1371/journal.pone.0006864] [Citation(s) in RCA: 183] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2009] [Accepted: 07/27/2009] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Single-cell genome sequencing has the potential to allow the in-depth exploration of the vast genetic diversity found in uncultured microbes. We used the marine cyanobacterium Prochlorococcus as a model system for addressing important challenges facing high-throughput whole genome amplification (WGA) and complete genome sequencing of individual cells. METHODOLOGY/PRINCIPAL FINDINGS We describe a pipeline that enables single-cell WGA on hundreds of cells at a time while virtually eliminating non-target DNA from the reactions. We further developed a post-amplification normalization procedure that mitigates extreme variations in sequencing coverage associated with multiple displacement amplification (MDA), and demonstrated that the procedure increased sequencing efficiency and facilitated genome assembly. We report genome recovery as high as 99.6% with reference-guided assembly, and 95% with de novo assembly starting from a single cell. We also analyzed the impact of chimera formation during MDA on de novo assembly, and discuss strategies to minimize the presence of incorrectly joined regions in contigs. CONCLUSIONS/SIGNIFICANCE The methods describe in this paper will be useful for sequencing genomes of individual cells from a variety of samples.
Collapse
Affiliation(s)
- Sébastien Rodrigue
- Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
| | - Rex R. Malmstrom
- Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
| | - Aaron M. Berlin
- The Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Bruce W. Birren
- The Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Matthew R. Henn
- The Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Sallie W. Chisholm
- Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
| |
Collapse
|
7031
|
Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, Wendl MC, Zhang Q, Locke DP, Shi X, Fulton RS, Ley TJ, Wilson RK, Ding L, Mardis ER. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods 2009; 6:677-81. [PMID: 19668202 PMCID: PMC3661775 DOI: 10.1038/nmeth.1363] [Citation(s) in RCA: 1017] [Impact Index Per Article: 67.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2009] [Accepted: 07/13/2009] [Indexed: 11/09/2022]
Abstract
Detection and characterization of genomic structural variation are important for understanding the landscape of genetic variation in human populations and in complex diseases such as cancer. Recent studies demonstrate the feasibility of detecting structural variation using next-generation, short-insert, paired-end sequencing reads. However, the utility of these reads is not entirely clear, nor are the analysis methods with which accurate detection can be achieved. The algorithm BreakDancer predicts a wide variety of structural variants including insertion-deletions (indels), inversions and translocations. We examined BreakDancer's performance in simulation, in comparison with other methods and in analyses of a sample from an individual with acute myeloid leukemia and of samples from the 1,000 Genomes trio individuals. BreakDancer sensitively and accurately detected indels ranging from 10 base pairs to 1 megabase pair that are difficult to detect via a single conventional approach.
Collapse
Affiliation(s)
- Ken Chen
- The Genome Center, Washington University School of Medicine, St. Louis, Missouri, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
7032
|
Giannakis M, Bäckhed HK, Chen SL, Faith JJ, Wu M, Guruge JL, Engstrand L, Gordon JI. Response of gastric epithelial progenitors to Helicobacter pylori Isolates obtained from Swedish patients with chronic atrophic gastritis. J Biol Chem 2009; 284:30383-94. [PMID: 19723631 PMCID: PMC2781593 DOI: 10.1074/jbc.m109.052738] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Helicobacter pylori infection is associated with gastric adenocarcinoma in some humans, especially those that develop an antecedent condition, chronic atrophic gastritis (ChAG). Gastric epithelial progenitors (GEPs) in transgenic gnotobiotic mice with a ChAG-like phenotype harbor intracellular collections of H. pylori. To characterize H. pylori adaptations to ChAG, we sequenced the genomes of 24 isolates obtained from 6 individuals, each sampled over a 4-year interval, as they did or did not progress from normal gastric histology to ChAG and/or adenocarcinoma. H. pylori populations within study participants were largely clonal and remarkably stable regardless of disease state. GeneChip studies of the responses of a cultured mouse gastric stem cell-like line (mGEPs) to infection with sequenced strains yielded a 695-member dataset of transcripts that are (i) differentially expressed after infection with ChAG-associated isolates, but not with a “normal” or a heat-killed ChAG isolate, and (ii) enriched in genes and gene functions associated with tumorigenesis in general and gastric carcinogenesis in specific cases. Transcriptional profiling of a ChAG strain during mGEP infection disclosed a set of responses, including up-regulation of hopZ, an adhesin belonging to a family of outer membrane proteins. Expression profiles of wild-type and ΔhopZ strains revealed a number of pH-regulated genes modulated by HopZ, including hopP, which binds sialylated glycans produced by GEPs in vivo. Genetic inactivation of hopZ produced a fitness defect in the stomachs of gnotobiotic transgenic mice but not in wild-type littermates. This study illustrates an approach for identifying GEP responses specific to ChAG-associated H. Pylori strains and bacterial genes important for survival in a model of the ChAG gastric ecosystem.
Collapse
Affiliation(s)
- Marios Giannakis
- Center for Genome Sciences, Washington University, St Louis, Missouri 63108, USA
| | | | | | | | | | | | | | | |
Collapse
|
7033
|
Morozova O, Hirst M, Marra MA. Applications of new sequencing technologies for transcriptome analysis. Annu Rev Genomics Hum Genet 2009; 10:135-51. [PMID: 19715439 DOI: 10.1146/annurev-genom-082908-145957] [Citation(s) in RCA: 340] [Impact Index Per Article: 22.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Transcriptome analysis has been a key area of biological inquiry for decades. Over the years, research in the field has progressed from candidate gene-based detection of RNAs using Northern blotting to high-throughput expression profiling driven by the advent of microarrays. Next-generation sequencing technologies have revolutionized transcriptomics by providing opportunities for multidimensional examinations of cellular transcriptomes in which high-throughput expression data are obtained at a single-base resolution.
Collapse
Affiliation(s)
- Olena Morozova
- BC Cancer Agency, Genome Sciences Center, Vancouver, BC V5Z 4S6, Canada.
| | | | | |
Collapse
|
7034
|
Soderlund C, Johnson E, Bomhoff M, Descour A. PAVE: program for assembling and viewing ESTs. BMC Genomics 2009; 10:400. [PMID: 19709403 PMCID: PMC2748094 DOI: 10.1186/1471-2164-10-400] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2009] [Accepted: 08/26/2009] [Indexed: 11/10/2022] Open
Abstract
Background New sequencing technologies are rapidly emerging. Many laboratories are simultaneously working with the traditional Sanger ESTs and experimenting with ESTs generated by the 454 Life Science sequencers. Though Sanger ESTs have been used to generate contigs for many years, no program takes full advantage of the 5' and 3' mate-pair information, hence, many tentative transcripts are assembled into two separate contigs. The new 454 technology has the benefit of high-throughput expression profiling, but introduces time and space problems for assembling large contigs. Results The PAVE (Program for Assembling and Viewing ESTs) assembler takes advantage of the 5' and 3' mate-pair information by requiring that the mate-pairs be assembled into the same contig and joined by n's if the two sub-contigs do not overlap. It handles the depth of 454 data sets by "burying" similar ESTs during assembly, which retains the expression level information while circumventing time and space problems. PAVE uses MegaBLAST for the clustering step and CAP3 for assembly, however it assembles incrementally to enforce the mate-pair constraint, bury ESTs, and reduce incorrect joins and splits. The PAVE data management system uses a MySQL database to store multiple libraries of ESTs along with their metadata; the management system allows multiple assemblies with variations on libraries and parameters. Analysis routines provide standard annotation for the contigs including a measure of differentially expressed genes across the libraries. A Java viewer program is provided for display and analysis of the results. Our results clearly show the benefit of using the PAVE assembler to explicitly use mate-pair information and bury ESTs for large contigs. Conclusion The PAVE assembler provides a software package for assembling Sanger and/or 454 ESTs. The assembly software, data management software, Java viewer and user's guide are freely available.
Collapse
Affiliation(s)
- Carol Soderlund
- BIO5 Institute, University of Arizona, Tucson, AZ 85721, USA.
| | | | | | | |
Collapse
|
7035
|
Gibbons JG, Janson EM, Hittinger CT, Johnston M, Abbot P, Rokas A. Benchmarking next-generation transcriptome sequencing for functional and evolutionary genomics. Mol Biol Evol 2009; 26:2731-44. [PMID: 19706727 DOI: 10.1093/molbev/msp188] [Citation(s) in RCA: 129] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
Next-generation sequencing has opened the door to genomic analysis of nonmodel organisms. Technologies generating long-sequence reads (200-400 bp) are increasingly used in evolutionary studies of nonmodel organisms, but the short-sequence reads (30-50 bp) that can be produced at lower cost are thought to be of limited utility for de novo sequencing applications. Here, we tested this assumption by short-read sequencing the transcriptomes of the tropical disease vectors Aedes aegypti and Anopheles gambiae, for which complete genome sequences are available. Comparison of our results to the reference genomes allowed us to accurately evaluate the quantity, quality, and functional and evolutionary information content of our "test" data. We produced more than 0.7 billion nucleotides of sequenced data per species that assembled into more than 21,000 test contigs larger than 100 bp per species and covered approximately 27% of the Aedes reference transcriptome. Remarkably, the substitution error rate in the test contigs was approximately 0.25% per site, with very few indels or assembly errors. Test contigs of both species were enriched for genes involved in energy production and protein synthesis and underrepresented in genes involved in transcription and differentiation. Ortholog prediction using the test contigs was accurate across hundreds of millions of years of evolution. Our results demonstrate the considerable utility of short-read transcriptome sequencing for genomic studies of nonmodel organisms and suggest an approach for assessing the information content of next-generation data for evolutionary studies.
Collapse
Affiliation(s)
- John G Gibbons
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, USA
| | | | | | | | | | | |
Collapse
|
7036
|
Studholme DJ, Ibanez SG, MacLean D, Dangl JL, Chang JH, Rathjen JP. A draft genome sequence and functional screen reveals the repertoire of type III secreted proteins of Pseudomonas syringae pathovar tabaci 11528. BMC Genomics 2009; 10:395. [PMID: 19703286 PMCID: PMC2745422 DOI: 10.1186/1471-2164-10-395] [Citation(s) in RCA: 74] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2009] [Accepted: 08/24/2009] [Indexed: 11/28/2022] Open
Abstract
Background Pseudomonas syringae is a widespread bacterial pathogen that causes disease on a broad range of economically important plant species. Pathogenicity of P. syringae strains is dependent on the type III secretion system, which secretes a suite of up to about thirty virulence 'effector' proteins into the host cytoplasm where they subvert the eukaryotic cell physiology and disrupt host defences. P. syringae pathovar tabaci naturally causes disease on wild tobacco, the model member of the Solanaceae, a family that includes many crop species as well as on soybean. Results We used the 'next-generation' Illumina sequencing platform and the Velvet short-read assembly program to generate a 145X deep 6,077,921 nucleotide draft genome sequence for P. syringae pathovar tabaci strain 11528. From our draft assembly, we predicted 5,300 potential genes encoding proteins of at least 100 amino acids long, of which 303 (5.72%) had no significant sequence similarity to those encoded by the three previously fully sequenced P. syringae genomes. Of the core set of Hrp Outer Proteins that are conserved in three previously fully sequenced P. syringae strains, most were also conserved in strain 11528, including AvrE1, HopAH2, HopAJ2, HopAK1, HopAN1, HopI, HopJ1, HopX1, HrpK1 and HrpW1. However, the hrpZ1 gene is partially deleted and hopAF1 is completely absent in 11528. The draft genome of strain 11528 also encodes close homologues of HopO1, HopT1, HopAH1, HopR1, HopV1, HopAG1, HopAS1, HopAE1, HopAR1, HopF1, and HopW1 and a degenerate HopM1'. Using a functional screen, we confirmed that hopO1, hopT1, hopAH1, hopM1', hopAE1, hopAR1, and hopAI1' are part of the virulence-associated HrpL regulon, though the hopAI1' and hopM1' sequences were degenerate with premature stop codons. We also discovered two additional HrpL-regulated effector candidates and an HrpL-regulated distant homologue of avrPto1. Conclusion The draft genome sequence facilitates the continued development of P. syringae pathovar tabaci on wild tobacco as an attractive model system for studying bacterial disease on plants. The catalogue of effectors sheds further light on the evolution of pathogenicity and host-specificity as well as providing a set of molecular tools for the study of plant defence mechanisms. We also discovered several large genomic regions in Pta 11528 that do not share detectable nucleotide sequence similarity with previously sequenced Pseudomonas genomes. These regions may include horizontally acquired islands that possibly contribute to pathogenicity or epiphytic fitness of Pta 11528.
Collapse
|
7037
|
Parker HG, VonHoldt BM, Quignon P, Margulies EH, Shao S, Mosher DS, Spady TC, Elkahloun A, Cargill M, Jones PG, Maslen CL, Acland GM, Sutter NB, Kuroki K, Bustamante CD, Wayne RK, Ostrander EA. An expressed fgf4 retrogene is associated with breed-defining chondrodysplasia in domestic dogs. Science 2009; 325:995-8. [PMID: 19608863 PMCID: PMC2748762 DOI: 10.1126/science.1173275] [Citation(s) in RCA: 238] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Retrotransposition of processed mRNAs is a common source of novel sequence acquired during the evolution of genomes. Although the vast majority of retroposed gene copies, or retrogenes, rapidly accumulate debilitating mutations that disrupt the reading frame, a small percentage become new genes that encode functional proteins. By using a multibreed association analysis in the domestic dog, we demonstrate that expression of a recently acquired retrogene encoding fibroblast growth factor 4 (fgf4) is strongly associated with chondrodysplasia, a short-legged phenotype that defines at least 19 dog breeds including dachshund, corgi, and basset hound. These results illustrate the important role of a single evolutionary event in constraining and directing phenotypic diversity in the domestic dog.
Collapse
Affiliation(s)
- Heidi G. Parker
- Cancer Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892 USA
| | - Bridgett M. VonHoldt
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA, 90095 USA
| | - Pascale Quignon
- Cancer Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892 USA
| | - Elliott H. Margulies
- Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892 USA
| | - Stephanie Shao
- Cancer Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892 USA
| | - Dana S. Mosher
- Cancer Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892 USA
| | - Tyrone C. Spady
- Cancer Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892 USA
| | - Abdel Elkahloun
- Cancer Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892 USA
| | - Michele Cargill
- Affymetrix Corporation, 3420 Central Expwy, Santa Clara, CA 95051 USA
| | - Paul G. Jones
- The WALTHAM® Centre for Pet Nutrition, Waltham on the Wolds, Leicestershire, UK, LE14 4RT
| | - Cheryl L. Maslen
- Division of Cardiovascular Medicine, Oregon Health & Science University, Portland, OR 97239 USA
| | - Gregory M. Acland
- Baker Institute for Animal Health, Cornell University, Ithaca, NY 14853, USA
- College of Veterinary Medicine, Cornell University, Ithaca, NY 14853, USA
| | - Nathan B. Sutter
- College of Veterinary Medicine, Cornell University, Ithaca, NY 14853, USA
| | - Keiichi Kuroki
- Comparative Orthopaedic Laboratory, University of Missouri, Columbia, MO 65211
| | - Carlos D. Bustamante
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY, USA
| | - Robert K. Wayne
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA, 90095 USA
| | - Elaine A. Ostrander
- Cancer Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892 USA
| |
Collapse
|
7038
|
Varshney RK, Nayak SN, May GD, Jackson SA. Next-generation sequencing technologies and their implications for crop genetics and breeding. Trends Biotechnol 2009; 27:522-30. [PMID: 19679362 DOI: 10.1016/j.tibtech.2009.05.006] [Citation(s) in RCA: 401] [Impact Index Per Article: 26.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2009] [Revised: 05/21/2009] [Accepted: 05/27/2009] [Indexed: 10/20/2022]
Abstract
Using next-generation sequencing technologies it is possible to resequence entire plant genomes or sample entire transcriptomes more efficiently and economically and in greater depth than ever before. Rather than sequencing individual genomes, we envision the sequencing of hundreds or even thousands of related genomes to sample genetic diversity within and between germplasm pools. Identification and tracking of genetic variation are now so efficient and precise that thousands of variants can be tracked within large populations. In this review, we outline some important areas such as the large-scale development of molecular markers for linkage mapping, association mapping, wide crosses and alien introgression, epigenetic modifications, transcript profiling, population genetics and de novo genome/organellar genome assembly for which these technologies are expected to advance crop genetics and breeding, leading to crop improvement.
Collapse
Affiliation(s)
- Rajeev K Varshney
- Centre of Excellence in Genomics (CEG), International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Patancheru 502324, A.P., India.
| | | | | | | |
Collapse
|
7039
|
Bozdag S, Close TJ, Lonardi S. A compartmentalized approach to the assembly of physical maps. BMC Bioinformatics 2009; 10:217. [PMID: 19604400 PMCID: PMC2717093 DOI: 10.1186/1471-2105-10-217] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2008] [Accepted: 07/15/2009] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Physical maps have been historically one of the cornerstones of genome sequencing and map-based cloning strategies. They also support marker assisted breeding and EST mapping. The problem of building a high quality physical map is computationally challenging due to unavoidable noise in the input fingerprint data. RESULTS We propose a novel compartmentalized method for the assembly of high quality physical maps from fingerprinted clones. The knowledge of genetic markers enables us to group clones into clusters so that clones in the same cluster are more likely to overlap. For each cluster of clones, a local physical map is first constructed using FingerPrinted Contigs (FPC). Then, all the individual maps are carefully merged into the final physical map. Experimental results on the genomes of rice and barley demonstrate that the compartmentalized assembly produces significantly more accurate maps, and that it can detect and isolate clones that would induce "chimeric" contigs if used in the final assembly. CONCLUSION The software is available for download at http://www.cs.ucr.edu/~sbozdag/assembler/
Collapse
Affiliation(s)
- Serdar Bozdag
- National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
| | | | | |
Collapse
|
7040
|
Du J, Bjornson RD, Zhang ZD, Kong Y, Snyder M, Gerstein MB. Integrating sequencing technologies in personal genomics: optimal low cost reconstruction of structural variants. PLoS Comput Biol 2009; 5:e1000432. [PMID: 19593373 PMCID: PMC2700963 DOI: 10.1371/journal.pcbi.1000432] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2008] [Accepted: 06/04/2009] [Indexed: 12/02/2022] Open
Abstract
The goal of human genome re-sequencing is obtaining an accurate assembly of an individual's genome. Recently, there has been great excitement in the development of many technologies for this (e.g. medium and short read sequencing from companies such as 454 and SOLiD, and high-density oligo-arrays from Affymetrix and NimbelGen), with even more expected to appear. The costs and sensitivities of these technologies differ considerably from each other. As an important goal of personal genomics is to reduce the cost of re-sequencing to an affordable point, it is worthwhile to consider optimally integrating technologies. Here, we build a simulation toolbox that will help us optimally combine different technologies for genome re-sequencing, especially in reconstructing large structural variants (SVs). SV reconstruction is considered the most challenging step in human genome re-sequencing. (It is sometimes even harder than de novo assembly of small genomes because of the duplications and repetitive sequences in the human genome.) To this end, we formulate canonical problems that are representative of issues in reconstruction and are of small enough scale to be computationally tractable and simulatable. Using semi-realistic simulations, we show how we can combine different technologies to optimally solve the assembly at low cost. With mapability maps, our simulations efficiently handle the inhomogeneous repeat-containing structure of the human genome and the computational complexity of practical assembly algorithms. They quantitatively show how combining different read lengths is more cost-effective than using one length, how an optimal mixed sequencing strategy for reconstructing large novel SVs usually also gives accurate detection of SNPs/indels, how paired-end reads can improve reconstruction efficiency, and how adding in arrays is more efficient than just sequencing for disentangling some complex SVs. Our strategy should facilitate the sequencing of human genomes at maximum accuracy and low cost. In recent years, the development of high throughput sequencing and array technologies has enabled the accurate re-sequencing of individual genomes, especially in identifying and reconstructing the variants in an individual's genome compared to a “reference”. The costs and sensitivities of these technologies differ considerably from each other, and even more technologies are expected to appear in the near future. To both reduce the total cost of re-sequencing to an affordable point and be adaptive to these constantly evolving bio-technologies, we propose to build a computationally efficient simulation framework that can help us optimize the combination of different technologies to perform low cost comparative genome re-sequencing, especially in reconstructing large structural variants, which is considered in many respects the most challenging step in genome re-sequencing. Our simulation results quantitatively show how much improvement one can gain in reconstructing large structural variants by integrating different technologies in optimal ways. We envision that in the future, more experimental technologies will be incorporated into this simulation framework and its results can provide informative guidelines for the actual experimental design to achieve optimal genome re-sequencing output at low costs.
Collapse
Affiliation(s)
- Jiang Du
- Department of Computer Science, Yale University, New Haven, Connecticut, United States of America
| | - Robert D. Bjornson
- Department of Computer Science, Yale University, New Haven, Connecticut, United States of America
- Keck Biotechnology Resource Laboratory, Yale University, New Haven, Connecticut, United States of America
| | - Zhengdong D. Zhang
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America
| | - Yong Kong
- Keck Biotechnology Resource Laboratory, Yale University, New Haven, Connecticut, United States of America
| | - Michael Snyder
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut, United States of America
| | - Mark B. Gerstein
- Department of Computer Science, Yale University, New Haven, Connecticut, United States of America
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America
- * E-mail:
| |
Collapse
|
7041
|
Chen XS, Collins LJ, Biggs PJ, Penny D. High throughput genome-wide survey of small RNAs from the parasitic protists Giardia intestinalis and Trichomonas vaginalis. Genome Biol Evol 2009; 1:165-75. [PMID: 20333187 PMCID: PMC2817412 DOI: 10.1093/gbe/evp017] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/30/2009] [Indexed: 12/26/2022] Open
Abstract
RNA interference (RNAi) is a set of mechanisms which regulate gene expression in eukaryotes. Key elements of RNAi are small sense and antisense RNAs from 19 to 26 nt generated from double-stranded RNAs. MicroRNAs (miRNAs) are a major type of RNAi-associated small RNAs and are found in most eukaryotes studied to date. To investigate whether small RNAs associated with RNAi appear to be present in all eukaryotic lineages, and therefore present in the ancestral eukaryote, we studied two deep-branching protozoan parasites, Giardia intestinalis and Trichomonas vaginalis. Little is known about endogenous small RNAs involved in RNAi of these organisms. Using Illumina Solexa sequencing and genome-wide analysis of small RNAs from these distantly related deep-branching eukaryotes, we identified 10 strong miRNA candidates from Giardia and 11 from Trichomonas. We also found evidence of Giardia short-interfering RNAs potentially involved in the expression of variant-specific surface proteins. In addition, eight new small nucleolar RNAs from Trichomonas are identified. Our results indicate that miRNAs are likely to be general in ancestral eukaryotes and therefore are likely to be a universal feature of eukaryotes.
Collapse
Affiliation(s)
- Xiaowei Sylvia Chen
- Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North, New Zealand.
| | | | | | | |
Collapse
|
7042
|
Abstract
MOTIVATION High throughput sequencing technologies generate large amounts of short reads. Mapping these to a reference sequence consumes large amounts of processing time and memory, and read mapping errors can lead to noisy or incorrect alignments. SNP-o-matic is a fast, memory-efficient and stringent read mapping tool offering a variety of analytical output functions, with an emphasis on genotyping. AVAILABILITY http://snpomatic.sourceforge.net.
Collapse
|
7043
|
Zhao F, Hou H, Bao Q, Wu J. PGA4genomics for comparative genome assembly based on genetic algorithm optimization. Genomics 2009; 94:284-6. [PMID: 19573591 DOI: 10.1016/j.ygeno.2009.06.006] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2009] [Accepted: 06/19/2009] [Indexed: 11/28/2022]
Abstract
New sequencing technologies greatly facilitate the large-scale bacterial genome sequencing by reducing cost. However, a considerable bottleneck is in the finishing phase, where dozens to hundreds of gaps need to be closed. In this study, we constructed a web server (PGA4genomics) to help users automate gap closing based on comparative genomic syntenies. Extensive evaluations showed that it significantly outperforms previous methods and can produce highly accurate layout result, especially when assembling genomes that are only moderately related. The availability of such a platform would greatly benefit the research community working on bacterial genomics. PGA4genomics can be accessed at two mirror sites http://centre.bioinformatics.zj.cn:8080/pga or http://59.79.168.90:8080/pga.
Collapse
Affiliation(s)
- Fangqing Zhao
- Institute of Biomedical Informatics/Zhejiang Provincial Key Laboratory of Medical Genetics, Wenzhou Medical College, Wenzhou 325035, China.
| | | | | | | |
Collapse
|
7044
|
Ye K, Schulz MH, Long Q, Apweiler R, Ning Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 2009; 25:2865-71. [PMID: 19561018 PMCID: PMC2781750 DOI: 10.1093/bioinformatics/btp394] [Citation(s) in RCA: 1485] [Impact Index Per Article: 99.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Motivation: There is a strong demand in the genomic community to develop effective algorithms to reliably identify genomic variants. Indel detection using next-gen data is difficult and identification of long structural variations is extremely challenging. Results: We present Pindel, a pattern growth approach, to detect breakpoints of large deletions and medium-sized insertions from paired-end short reads. We use both simulated reads and real data to demonstrate the efficiency of the computer program and accuracy of the results. Availability: The binary code and a short user manual can be freely downloaded from http://www.ebi.ac.uk/∼kye/pindel/. Contact:k.ye@lumc.nl; zn1@sanger.ac.uk
Collapse
Affiliation(s)
- Kai Ye
- EMBL Outstation European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.
| | | | | | | | | |
Collapse
|
7045
|
Schröder J, Schröder H, Puglisi SJ, Sinha R, Schmidt B. SHREC: a short-read error correction method. Bioinformatics 2009; 25:2157-63. [PMID: 19542152 DOI: 10.1093/bioinformatics/btp379] [Citation(s) in RCA: 115] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
MOTIVATION Second-generation sequencing technologies produce a massive amount of short reads in a single experiment. However, sequencing errors can cause major problems when using this approach for de novo sequencing applications. Moreover, existing error correction methods have been designed and optimized for shotgun sequencing. Therefore, there is an urgent need for the design of fast and accurate computational methods and tools for error correction of large amounts of short read data. RESULTS We present SHREC, a new algorithm for correcting errors in short-read data that uses a generalized suffix trie on the read data as the underlying data structure. Our results show that the method can identify erroneous reads with sensitivity and specificity of over 99% and 96% for simulated data with error rates of up to 3% as well as for real data. Furthermore, it achieves an error correction accuracy of over 80% for simulated data and over 88% for real data. These results are clearly superior to previously published approaches. SHREC is available as an efficient open-source Java implementation that allows processing of 10 million of short reads on a standard workstation.
Collapse
Affiliation(s)
- Jan Schröder
- Institut für Informatik, Christian-Albrecht-Universität Kiel, Herman-Rodewald-Strasse 3, 24118 Kiel, Germany.
| | | | | | | | | |
Collapse
|
7046
|
Dutilh BE, Huynen MA, Strous M. Increasing the coverage of a metapopulation consensus genome by iterative read mapping and assembly. ACTA ACUST UNITED AC 2009; 25:2878-81. [PMID: 19542148 PMCID: PMC2781756 DOI: 10.1093/bioinformatics/btp377] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
Motivation: Most microbial species can not be cultured in the laboratory. Metagenomic sequencing may still yield a complete genome if the sequenced community is enriched and the sequencing coverage is high. However, the complexity in a natural population may cause the enrichment culture to contain multiple related strains. This diversity can confound existing strict assembly programs and lead to a fragmented assembly, which is unnecessary if we have a related reference genome available that can function as a scaffold. Results: Here, we map short metagenomic sequencing reads from a population of strains to a related reference genome, and compose a genome that captures the consensus of the population's sequences. We show that by iteration of the mapping and assembly procedure, the coverage increases while the similarity with the reference genome decreases. This indicates that the assembly becomes less dependent on the reference genome and approaches the consensus genome of the multi-strain population. Contact:dutilh@cmbi.ru.nl Supplementary Information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bas E Dutilh
- Center for Molecular and Biomolecular Informatics, Nijmegen Center for Molecular Life Sciences, Radboud University Nijmegen Medical Center, Geert Grooteplein 28, 6525 GA, Nijmegen, The Netherlands, Germany.
| | | | | |
Collapse
|
7047
|
Hurd PJ, Nelson CJ. Advantages of next-generation sequencing versus the microarray in epigenetic research. BRIEFINGS IN FUNCTIONAL GENOMICS AND PROTEOMICS 2009; 8:174-83. [PMID: 19535508 DOI: 10.1093/bfgp/elp013] [Citation(s) in RCA: 157] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Several recent studies from the field of epigenetics have combined chromatin-immunoprecipitation (ChIP) with next-generation high-throughput sequencing technologies to describe the locations of histone post-translational modifications (PTM) and DNA methylation genome-wide. While these reports begin to quench the chromatin biologists thirst for visualizing where in the genome epigenetic marks are placed, they also illustrate several advantages of sequencing based genomics compared to microarray analysis. Accordingly, next-generation sequencing (NGS) technologies are now challenging microarrays as the tool of choice for genome analysis. The increased affordability of comprehensive sequence-based genomic analysis will enable new questions to be addressed in many areas of biology. It is inevitable that massively-parallel sequencing platforms will supercede the microarray for many applications, however, there are niches for microarrays to fill and interestingly we may very well witness a symbiotic relationship between microarrays and high-throughput sequencing in the future.
Collapse
Affiliation(s)
- Paul J Hurd
- School of Biological and Chemical Sciences, Queen Mary University of London, London, E1 4NS, UK.
| | | |
Collapse
|
7048
|
Schmidt B, Sinha R, Beresford-Smith B, Puglisi SJ. A fast hybrid short read fragment assembly algorithm. Bioinformatics 2009; 25:2279-80. [PMID: 19535537 DOI: 10.1093/bioinformatics/btp374] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
SUMMARY The shorter and vastly more numerous reads produced by second-generation sequencing technologies require new tools that can assemble massive numbers of reads in reasonable time. Existing short-read assembly tools can be classified into two categories: greedy extension-based and graph-based. While the graph-based approaches are generally superior in terms of assembly quality, the computer resources required for building and storing a huge graph are very high. In this article, we present Taipan, an assembly algorithm which can be viewed as a hybrid of these two approaches. Taipan uses greedy extensions for contig construction but at each step realizes enough of the corresponding read graph to make better decisions as to how assembly should continue. We show that this approach can achieve an assembly quality at least as good as the graph-based approaches used in the popular Edena and Velvet assembly tools using a moderate amount of computing resources.
Collapse
Affiliation(s)
- Bertil Schmidt
- School of Computer Engineering, Nanyang Technological University, Singapore.
| | | | | | | |
Collapse
|
7049
|
Birol I, Jackman SD, Nielsen CB, Qian JQ, Varhol R, Stazyk G, Morin RD, Zhao Y, Hirst M, Schein JE, Horsman DE, Connors JM, Gascoyne RD, Marra MA, Jones SJM. De novo transcriptome assembly with ABySS. Bioinformatics 2009; 25:2872-7. [PMID: 19528083 DOI: 10.1093/bioinformatics/btp367] [Citation(s) in RCA: 297] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
MOTIVATION Whole transcriptome shotgun sequencing data from non-normalized samples offer unique opportunities to study the metabolic states of organisms. One can deduce gene expression levels using sequence coverage as a surrogate, identify coding changes or discover novel isoforms or transcripts. Especially for discovery of novel events, de novo assembly of transcriptomes is desirable. RESULTS Transcriptome from tumor tissue of a patient with follicular lymphoma was sequenced with 36 base pair (bp) single- and paired-end reads on the Illumina Genome Analyzer II platform. We assembled approximately 194 million reads using ABySS into 66 921 contigs 100 bp or longer, with a maximum contig length of 10 951 bp, representing over 30 million base pairs of unique transcriptome sequence, or roughly 1% of the genome. AVAILABILITY AND IMPLEMENTATION Source code and binaries of ABySS are freely available for download at http://www.bcgsc.ca/platform/bioinfo/software/abyss. Assembler tool is implemented in C++. The parallel version uses Open MPI. ABySS-Explorer tool is implemented in Java using the Java universal network/graph framework. CONTACT ibirol@bcgsc.ca.
Collapse
Affiliation(s)
- Inanç Birol
- Genome Sciences Centre, Vancouver, BC V5Z 4S6, Canada.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
7050
|
Abstract
We have generated extreme ionizing radiation resistance in a relatively sensitive bacterial species, Escherichia coli, by directed evolution. Four populations of Escherichia coli K-12 were derived independently from strain MG1655, with each specifically adapted to survive exposure to high doses of ionizing radiation. D(37) values for strains isolated from two of the populations approached that exhibited by Deinococcus radiodurans. Complete genomic sequencing was carried out on nine purified strains derived from these populations. Clear mutational patterns were observed that both pointed to key underlying mechanisms and guided further characterization of the strains. In these evolved populations, passive genomic protection is not in evidence. Instead, enhanced recombinational DNA repair makes a prominent but probably not exclusive contribution to genome reconstitution. Multiple genes, multiple alleles of some genes, multiple mechanisms, and multiple evolutionary pathways all play a role in the evolutionary acquisition of extreme radiation resistance. Several mutations in the recA gene and a deletion of the e14 prophage both demonstrably contribute to and partially explain the new phenotype. Mutations in additional components of the bacterial recombinational repair system and the replication restart primosome are also prominent, as are mutations in genes involved in cell division, protein turnover, and glutamate transport. At least some evolutionary pathways to extreme radiation resistance are constrained by the temporally ordered appearance of specific alleles.
Collapse
|