1
|
Kerepesi C, Szalkai B, Grolmusz V. Visual analysis of the quantitative composition of metagenomic communities: the AmphoraVizu webserver. MICROBIAL ECOLOGY 2015; 69:695-697. [PMID: 25296554 DOI: 10.1007/s00248-014-0502-6] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/20/2014] [Accepted: 09/24/2014] [Indexed: 06/04/2023]
Abstract
Low-cost DNA sequencing methods have given rise to an enormous development of metagenomics in the past few years. One basic--and difficult--task is the phylogenetic annotation of the metagenomic samples studied. The difficulty comes from the fact that the typical environmental sample contains hundreds of unknown and still uncharacterized microorganisms. There are several possible methods to assign at least partial phylogenetic information to these uncharacterized data. Originally, the 16S ribosomal RNA was used as phylogenetic marker, then genome sequence alignments and similarity measures between the unknown genome and the reference genomes were applied (e.g., in the MEGAN software), and more recently, phylogeny-based methods applying suitable sets of marker genes were suggested (AMPHORA, AMPHORA2, and the webserver implementation AmphoraNet). Here, we present a visual analysis tool that is capable of demonstrating the quantitative relations gained from the output of the AMPHORA2 program or the easy-to-use AmphoraNet webserver. Our web-based tool, the AmphoraVizu webserver, makes the phylogenetic distribution of the metagenomic sample clearly visible by using the native output format of AMPHORA2 or AmphoraNet. The user may set the phylogenetic resolution (i.e., superkingdom, phylum, class, order, family, genus, and species) along with the chart type and will receive the distribution data detailed for all relevant marker genes in the sample. For publication quality results, the chart labels can be customized by the user. The visualization webserver is available at the address http://amphoravizu.pitgroup.org. The AmphoraNet webserver is available at http://amphoranet.pitgroup.org. The open-source version of the AmphoraVizu program is available for download at http://pitgroup.org/apps/amphoravizu/AmphoraVizu.pl.
Collapse
Affiliation(s)
- Csaba Kerepesi
- PIT Bioinformatics Group, Eötvös University, Pázmány Péter stny 1/c, 1117, Budapest, Hungary,
| | | | | |
Collapse
|
2
|
Kerepesi C, Bánky D, Grolmusz V. AmphoraNet: the webserver implementation of the AMPHORA2 metagenomic workflow suite. Gene 2013; 533:538-40. [PMID: 24144838 DOI: 10.1016/j.gene.2013.10.015] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2013] [Revised: 10/07/2013] [Accepted: 10/08/2013] [Indexed: 02/07/2023]
Abstract
MOTIVATION Metagenomics went through an astonishing development in the past few years. Today not only gene sequencing experts, but numerous laboratories of other specializations need to analyze DNA sequences gained from clinical or environmental samples. Phylogenetic analysis of the metagenomic data presents significant challenges for the biologist and the bioinformatician. The program suite AMPHORA and its workflow version are examples of publicly available software that yields reliable phylogenetic results for metagenomic data. RESULTS Here we present AmphoraNet, an easy-to-use webserver that is capable of assigning a probability-weighted taxonomic group for each phylogenetic marker gene found in the input metagenomic sample; the webserver is based on the AMPHORA2 workflow. Since a large proportion of molecular biologists uses the BLAST program and its clones on public webservers instead of the locally installed versions, we believe that the occasional user may find it comfortable that, in this version, no time-consuming installation of every component of the AMPHORA2 suite or expertise in Linux environment is required. AVAILABILITY The webserver is freely available at http://amphoranet.pitgroup.org; no registration is required.
Collapse
Affiliation(s)
- Csaba Kerepesi
- PIT Bioinformatics Group, Eötvös University, H-1117 Budapest, Hungary.
| | | | | |
Collapse
|
3
|
Kuroshu RM. Nonoverlapping clone pooling for high-throughput sequencing. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:1091-1097. [PMID: 24384700 DOI: 10.1109/tcbb.2013.83] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Simultaneously sequencing multiple clones using second-generation sequencers can speed up many essential clone-based sequencing methods. However, in applications such as fosmid clone sequencing and full-length cDNA sequencing, it is important to create pools of clones that do not overlap on the genome for the identification of structural variations and alternatively spliced transcripts, respectively. We define the nonoverlapping clone pooling problem and provide practical solutions based on optimal graph coloring and bin-packing algorithms with constant absolute worst-case ratios, and further extend them to cope with repetitive mappings. Using theoretical analysis and experiments, we also show that the proposed methods are applicable.
Collapse
|
4
|
Lonardi S, Duma D, Alpert M, Cordero F, Beccuti M, Bhat PR, Wu Y, Ciardo G, Alsaihati B, Ma Y, Wanamaker S, Resnik J, Bozdag S, Luo MC, Close TJ. Combinatorial pooling enables selective sequencing of the barley gene space. PLoS Comput Biol 2013; 9:e1003010. [PMID: 23592960 PMCID: PMC3617026 DOI: 10.1371/journal.pcbi.1003010] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2012] [Accepted: 02/05/2013] [Indexed: 11/23/2022] Open
Abstract
For the vast majority of species – including many economically or ecologically important organisms, progress in biological research is hampered due to the lack of a reference genome sequence. Despite recent advances in sequencing technologies, several factors still limit the availability of such a critical resource. At the same time, many research groups and international consortia have already produced BAC libraries and physical maps and now are in a position to proceed with the development of whole-genome sequences organized around a physical map anchored to a genetic map. We propose a BAC-by-BAC sequencing protocol that combines combinatorial pooling design and second-generation sequencing technology to efficiently approach denovo selective genome sequencing. We show that combinatorial pooling is a cost-effective and practical alternative to exhaustive DNA barcoding when preparing sequencing libraries for hundreds or thousands of DNA samples, such as in this case gene-bearing minimum-tiling-path BAC clones. The novelty of the protocol hinges on the computational ability to efficiently compare hundred millions of short reads and assign them to the correct BAC clones (deconvolution) so that the assembly can be carried out clone-by-clone. Experimental results on simulated data for the rice genome show that the deconvolution is very accurate, and the resulting BAC assemblies have high quality. Results on real data for a gene-rich subset of the barley genome confirm that the deconvolution is accurate and the BAC assemblies have good quality. While our method cannot provide the level of completeness that one would achieve with a comprehensive whole-genome sequencing project, we show that it is quite successful in reconstructing the gene sequences within BACs. In the case of plants such as barley, this level of sequence knowledge is sufficient to support critical end-point objectives such as map-based cloning and marker-assisted breeding. The problem of obtaining the full genomic sequence of an organism has been solved either via a global brute-force approach (called whole-genome shotgun) or by a divide-and-conquer strategy (called clone-by-clone). Both approaches have advantages and disadvantages in terms of cost, manual labor, and the ability to deal with sequencing errors and highly repetitive regions of the genome. With the advent of second-generation sequencing instruments, the whole-genome shotgun approach has been the preferred choice. The clone-by-clone strategy is, however, still very relevant for large complex genomes. In fact, several research groups and international consortia have produced clone libraries and physical maps for many economically or ecologically important organisms and now are in a position to proceed with sequencing. In this manuscript, we demonstrate the feasibility of this approach on the gene-space of a large, very repetitive plant genome. The novelty of our approach is that, in order to take advantage of the throughput of the current generation of sequencing instruments, we pool hundreds of clones using a special type of “smart” pooling design that allows one to establish with high accuracy the source clone from the sequenced reads in a pool. Extensive simulations and experimental results support our claims.
Collapse
Affiliation(s)
- Stefano Lonardi
- Department of Computer Science and Engineering, University of California, Riverside, California, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
5
|
Accurate Decoding of Pooled Sequenced Data Using Compressed Sensing. LECTURE NOTES IN COMPUTER SCIENCE 2013. [DOI: 10.1007/978-3-642-40453-5_7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
|
6
|
Feder AF, Petrov DA, Bergland AO. LDx: estimation of linkage disequilibrium from high-throughput pooled resequencing data. PLoS One 2012; 7:e48588. [PMID: 23152785 PMCID: PMC3494690 DOI: 10.1371/journal.pone.0048588] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2012] [Accepted: 10/03/2012] [Indexed: 12/14/2022] Open
Abstract
High-throughput pooled resequencing offers significant potential for whole genome population sequencing. However, its main drawback is the loss of haplotype information. In order to regain some of this information, we present LDx, a computational tool for estimating linkage disequilibrium (LD) from pooled resequencing data. LDx uses an approximate maximum likelihood approach to estimate LD (r(2)) between pairs of SNPs that can be observed within and among single reads. LDx also reports r(2) estimates derived solely from observed genotype counts. We demonstrate that the LDx estimates are highly correlated with r(2) estimated from individually resequenced strains. We discuss the performance of LDx using more stringent quality conditions and infer via simulation the degree to which performance can improve based on read depth. Finally we demonstrate two possible uses of LDx with real and simulated pooled resequencing data. First, we use LDx to infer genomewide patterns of decay of LD with physical distance in D. melanogaster population resequencing data. Second, we demonstrate that r(2) estimates from LDx are capable of distinguishing alternative demographic models representing plausible demographic histories of D. melanogaster.
Collapse
Affiliation(s)
- Alison F Feder
- Department of Biology, Stanford University, Stanford, California, United States of America.
| | | | | |
Collapse
|
7
|
Elsharawy A, Forster M, Schracke N, Keller A, Thomsen I, Petersen BS, Stade B, Stähler P, Schreiber S, Rosenstiel P, Franke A. Improving mapping and SNP-calling performance in multiplexed targeted next-generation sequencing. BMC Genomics 2012; 13:417. [PMID: 22913592 PMCID: PMC3563481 DOI: 10.1186/1471-2164-13-417] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2011] [Accepted: 08/10/2012] [Indexed: 11/10/2022] Open
Abstract
Background Compared to classical genotyping, targeted next-generation sequencing (tNGS) can be custom-designed to interrogate entire genomic regions of interest, in order to detect novel as well as known variants. To bring down the per-sample cost, one approach is to pool barcoded NGS libraries before sample enrichment. Still, we lack a complete understanding of how this multiplexed tNGS approach and the varying performance of the ever-evolving analytical tools can affect the quality of variant discovery. Therefore, we evaluated the impact of different software tools and analytical approaches on the discovery of single nucleotide polymorphisms (SNPs) in multiplexed tNGS data. To generate our own test model, we combined a sequence capture method with NGS in three experimental stages of increasing complexity (E. coli genes, multiplexed E. coli, and multiplexed HapMap BRCA1/2 regions). Results We successfully enriched barcoded NGS libraries instead of genomic DNA, achieving reproducible coverage profiles (Pearson correlation coefficients of up to 0.99) across multiplexed samples, with <10% strand bias. However, the SNP calling quality was substantially affected by the choice of tools and mapping strategy. With the aim of reducing computational requirements, we compared conventional whole-genome mapping and SNP-calling with a new faster approach: target-region mapping with subsequent ‘read-backmapping’ to the whole genome to reduce the false detection rate. Consequently, we developed a combined mapping pipeline, which includes standard tools (BWA, SAMtools, etc.), and tested it on public HiSeq2000 exome data from the 1000 Genomes Project. Our pipeline saved 12 hours of run time per Hiseq2000 exome sample and detected ~5% more SNPs than the conventional whole genome approach. This suggests that more potential novel SNPs may be discovered using both approaches than with just the conventional approach. Conclusions We recommend applying our general ‘two-step’ mapping approach for more efficient SNP discovery in tNGS. Our study has also shown the benefit of computing inter-sample SNP-concordances and inspecting read alignments in order to attain more confident results.
Collapse
Affiliation(s)
- Abdou Elsharawy
- Institute of Clinical Molecular Biology, Christian-Albrechts-University, Kiel, Germany
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
8
|
Zhu Y, Bergland AO, González J, Petrov DA. Empirical validation of pooled whole genome population re-sequencing in Drosophila melanogaster. PLoS One 2012; 7:e41901. [PMID: 22848651 PMCID: PMC3406057 DOI: 10.1371/journal.pone.0041901] [Citation(s) in RCA: 73] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2012] [Accepted: 06/28/2012] [Indexed: 11/26/2022] Open
Abstract
The sequencing of pooled non-barcoded individuals is an inexpensive and efficient means of assessing genome-wide population allele frequencies, yet its accuracy has not been thoroughly tested. We assessed the accuracy of this approach on whole, complex eukaryotic genomes by resequencing pools of largely isogenic, individually sequenced Drosophila melanogaster strains. We called SNPs in the pooled data and estimated false positive and false negative rates using the SNPs called in individual strain as a reference. We also estimated allele frequency of the SNPs using “pooled” data and compared them with “true” frequencies taken from the estimates in the individual strains. We demonstrate that pooled sequencing provides a faithful estimate of population allele frequency with the error well approximated by binomial sampling, and is a reliable means of novel SNP discovery with low false positive rates. However, a sufficient number of strains should be used in the pooling because variation in the amount of DNA derived from individual strains is a substantial source of noise when the number of pooled strains is low. Our results and analysis confirm that pooled sequencing is a very powerful and cost-effective technique for assessing of patterns of sequence variation in populations on genome-wide scales, and is applicable to any dataset where sequencing individuals or individual cells is impossible, difficult, time consuming, or expensive.
Collapse
Affiliation(s)
- Yuan Zhu
- Department of Genetics, Stanford University, Stanford, California, United States of America.
| | | | | | | |
Collapse
|
9
|
Yu G. Gnom(Cmp): a quantitative approach for comparative analysis of closely related genomes of bacterial pathogens. Genome 2011; 54:402-18. [PMID: 21539441 DOI: 10.1139/g11-005] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Comparative genome analysis is a powerful approach to understanding the biology of infectious bacterial pathogens. In this study, a quantitative approach, referred to as Gnom(Cmp), was developed to study the microevolution of bacterial pathogens. Although much more time-consuming than existing tools, this procedure provides a much higher resolution. Gnom(Cmp) accomplishes this by establishing genome-wide heterogeneity genotypes, which are then quantified and comparatively analyzed. The heterogeneity genotypes are defined as chromosomal base positions that have multiple variants within particular genomes, resulted from DNA duplications and subsequent mutations. To prove the concept, the procedure was applied on the genomes of 15 Staphylococcus aureus strains, focusing extensively on two pairs of hVISA/VISA strains. hVISA refers to heteroresistant vancomycin-intermediate S. aureus strains and VISA is their VISA mutants. hVISA/VISA displays some remarkable properties. hVISA is susceptible to vancomycin, but VISA mutants emerge soon after a short period of vancomycin therapy, therefore making the pathogen a great model organism for fast-evolving bacterial pathogens. The analysis indicated that Gnom(Cmp) could reveal variants within the genomes, which can be analyzed within the global genome context. Gnom(Cmp) discovered evolutionary hotspots and their dynamics among many closely related, even isogenic genomes. The analysis thus allows the exploration of the molecular mechanisms behind hVISA/VISA evolution, providing a working hypotheses for experimental testing and validation.
Collapse
Affiliation(s)
- GongXin Yu
- Department of Biological Science, Department of Computer Science, Boise State University, Boise, ID 83725, USA.
| |
Collapse
|
10
|
Bansal V. A statistical method for the detection of variants from next-generation resequencing of DNA pools. Bioinformatics 2010; 26:i318-24. [PMID: 20529923 PMCID: PMC2881398 DOI: 10.1093/bioinformatics/btq214] [Citation(s) in RCA: 128] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Motivation: Next-generation sequencing technologies have enabled the sequencing of several human genomes in their entirety. However, the routine resequencing of complete genomes remains infeasible. The massive capacity of next-generation sequencers can be harnessed for sequencing specific genomic regions in hundreds to thousands of individuals. Sequencing-based association studies are currently limited by the low level of multiplexing offered by sequencing platforms. Pooled sequencing represents a cost-effective approach for studying rare variants in large populations. To utilize the power of DNA pooling, it is important to accurately identify sequence variants from pooled sequencing data. Detection of rare variants from pooled sequencing represents a different challenge than detection of variants from individual sequencing. Results: We describe a novel statistical approach, CRISP [Comprehensive Read analysis for Identification of Single Nucleotide Polymorphisms (SNPs) from Pooled sequencing] that is able to identify both rare and common variants by using two approaches: (i) comparing the distribution of allele counts across multiple pools using contingency tables and (ii) evaluating the probability of observing multiple non-reference base calls due to sequencing errors alone. Information about the distribution of reads between the forward and reverse strands and the size of the pools is also incorporated within this framework to filter out false variants. Validation of CRISP on two separate pooled sequencing datasets generated using the Illumina Genome Analyzer demonstrates that it can detect 80–85% of SNPs identified using individual sequencing while achieving a low false discovery rate (3–5%). Comparison with previous methods for pooled SNP detection demonstrates the significantly lower false positive and false negative rates for CRISP. Availability: Implementation of this method is available at http://polymorphism.scripps.edu/∼vbansal/software/CRISP/ Contact:vbansal@scripps.edu
Collapse
Affiliation(s)
- Vikas Bansal
- Scripps Genomic Medicine, Scripps Translational Science Institute, La Jolla, CA 92037, USA.
| |
Collapse
|
11
|
Knudsen B, Forsberg R, Miyamoto MM. A computer simulator for assessing different challenges and strategies of de novo sequence assembly. Genes (Basel) 2010; 1:263-82. [PMID: 24710045 PMCID: PMC3954094 DOI: 10.3390/genes1020263] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2010] [Revised: 08/18/2010] [Accepted: 08/31/2010] [Indexed: 11/16/2022] Open
Abstract
This study presents a new computer program for assessing the effects of different factors and sequencing strategies on de novo sequence assembly. The program uses reads from actual sequencing studies or from simulations with a reference genome that may also be real or simulated. The simulated reads can be created with our read simulator. They can be of differing length and coverage, consist of paired reads with varying distance, and include sequencing errors such as color space miscalls to imitate SOLiD data. The simulated or real reads are mapped to their reference genome and our assembly simulator is then used to obtain optimal assemblies that are limited only by the distribution of repeats. By way of this mapping, the assembly simulator determines which contigs are theoretically possible, or conversely (and perhaps more importantly), which are not. We illustrate the application and utility of our new simulation tools with several experiments that test the effects of genome complexity (repeats), read length and coverage, word size in De Bruijn graph assembly, and alternative sequencing strategies (e.g., BAC pooling) on sequence assemblies. These experiments highlight just some of the uses of our simulators in the experimental design of sequencing projects and in the further development of assembly algorithms.
Collapse
Affiliation(s)
| | | | - Michael M Miyamoto
- Department of Biology, Box 118525, University of Florida, Gainesville, Florida, 32611-8525, USA.
| |
Collapse
|
12
|
Abstract
Resequencing genomic DNA from pools of individuals is an effective strategy to detect new variants in targeted regions and compare them between cases and controls. There are numerous ways to assign individuals to the pools on which they are to be sequenced. The naïve, disjoint pooling scheme (many individuals to one pool) in predominant use today offers insight into allele frequencies, but does not offer the identity of an allele carrier. We present a framework for overlapping pool design, where each individual sample is resequenced in several pools (many individuals to many pools). Upon discovering a variant, the set of pools where this variant is observed reveals the identity of its carrier. We formalize the mathematical framework for such pool designs and list the requirements from such designs. We specifically address three practical concerns for pooled resequencing designs: (1) false-positives due to errors introduced during amplification and sequencing; (2) false-negatives due to undersampling particular alleles aggravated by nonuniform coverage; and consequently, (3) ambiguous identification of individual carriers in the presence of errors. We build on theory of error-correcting codes to design pools that overcome these pitfalls. We show that in practical parameters of resequencing studies, our designs guarantee high probability of unambiguous singleton carrier identification while maintaining the features of naïve pools in terms of sensitivity, specificity, and the ability to estimate allele frequencies. We demonstrate the ability of our designs in extracting rare variations using short read data from the 1000 Genomes Pilot 3 project.
Collapse
Affiliation(s)
- Snehit Prabhu
- Department of Computer Science, Columbia University, New York, New York 10025, USA.
| | | |
Collapse
|