1
|
Curry KD, Yu FB, Vance SE, Segarra S, Bhaya D, Chikhi R, Rocha EPC, Treangen TJ. Reference-free structural variant detection in microbiomes via long-read co-assembly graphs. Bioinformatics 2024; 40:i58-i67. [PMID: 38940156 PMCID: PMC11211843 DOI: 10.1093/bioinformatics/btae224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION The study of bacterial genome dynamics is vital for understanding the mechanisms underlying microbial adaptation, growth, and their impact on host phenotype. Structural variants (SVs), genomic alterations of 50 base pairs or more, play a pivotal role in driving evolutionary processes and maintaining genomic heterogeneity within bacterial populations. While SV detection in isolate genomes is relatively straightforward, metagenomes present broader challenges due to the absence of clear reference genomes and the presence of mixed strains. In response, our proposed method rhea, forgoes reference genomes and metagenome-assembled genomes (MAGs) by encompassing all metagenomic samples in a series (time or other metric) into a single co-assembly graph. The log fold change in graph coverage between successive samples is then calculated to call SVs that are thriving or declining. RESULTS We show rhea to outperform existing methods for SV and horizontal gene transfer (HGT) detection in two simulated mock metagenomes, particularly as the simulated reads diverge from reference genomes and an increase in strain diversity is incorporated. We additionally demonstrate use cases for rhea on series metagenomic data of environmental and fermented food microbiomes to detect specific sequence alterations between successive time and temperature samples, suggesting host advantage. Our approach leverages previous work in assembly graph structural and coverage patterns to provide versatility in studying SVs across diverse and poorly characterized microbial communities for more comprehensive insights into microbial gene flux. AVAILABILITY AND IMPLEMENTATION rhea is open source and available at: https://github.com/treangenlab/rhea.
Collapse
Affiliation(s)
- Kristen D Curry
- Department of Computer Science, Rice University, 6100 Main St., Houston, TX 77005, United States
- Department of Genomes and Genetics, Microbial Evolutionary Genomics, Institut Pasteur, Université Paris Cité, CNRS, UMR3525, Paris 75015, France
| | | | - Summer E Vance
- Department of Environmental Science, Policy, and Management, University of California, Berkeley, CA 94720, United States
| | - Santiago Segarra
- Department of Electrical and Computer Engineering, Rice University, Houston, TX 77005, United States
| | - Devaki Bhaya
- Carnegie Institution for Science, Department of Plant Biology, Stanford, CA 94305, United States
| | - Rayan Chikhi
- Department of Computational Biology, Institut Pasteur, Université Paris Cité, Paris 75015, France
| | - Eduardo P C Rocha
- Department of Genomes and Genetics, Microbial Evolutionary Genomics, Institut Pasteur, Université Paris Cité, CNRS, UMR3525, Paris 75015, France
| | - Todd J Treangen
- Department of Computer Science, Rice University, 6100 Main St., Houston, TX 77005, United States
| |
Collapse
|
2
|
Curry KD, Yu FB, Vance SE, Segarra S, Bhaya D, Chikhi R, Rocha EP, Treangen TJ. Reference-free Structural Variant Detection in Microbiomes via Long-read Coassembly Graphs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.25.577285. [PMID: 38352454 PMCID: PMC10862772 DOI: 10.1101/2024.01.25.577285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/20/2024]
Abstract
Bacterial genome dynamics are vital for understanding the mechanisms underlying microbial adaptation, growth, and their broader impact on host phenotype. Structural variants (SVs), genomic alterations of 10 base pairs or more, play a pivotal role in driving evolutionary processes and maintaining genomic heterogeneity within bacterial populations. While SV detection in isolate genomes is relatively straightforward, metagenomes present broader challenges due to absence of clear reference genomes and presence of mixed strains. In response, our proposed method rhea, forgoes reference genomes and metagenome-assembled genomes (MAGs) by encompassing a single metagenome coassembly graph constructed from all samples in a series. The log fold change in graph coverage between subsequent samples is then calculated to call SVs that are thriving or declining throughout the series. We show rhea to outperform existing methods for SV and horizontal gene transfer (HGT) detection in two simulated mock metagenomes, which is particularly noticeable as the simulated reads diverge from reference genomes and an increase in strain diversity is incorporated. We additionally demonstrate use cases for rhea on series metagenomic data of environmental and fermented food microbiomes to detect specific sequence alterations between subsequent time and temperature samples, suggesting host advantage. Our innovative approach leverages raw read patterns rather than references or MAGs to include all sequencing reads in analysis, and thus provide versatility in studying SVs across diverse and poorly characterized microbial communities for more comprehensive insights into microbial genome dynamics.
Collapse
Affiliation(s)
- Kristen D. Curry
- Rice University, Department of Computer Science, Houston, TX 77005, United States
- Institut Pasteur, Université Paris Cité, CNRS, UMR3525, Microbial Evolutionary Genomics, 75015 Paris, France
| | | | - Summer E. Vance
- University of California, Berkeley, Department of Environmental Science, Policy, and Management, Berkeley, CA 94720, United States
| | - Santiago Segarra
- Rice University, Department of Electrical and Computer Engineering, Houston, TX 77005, United States
| | - Devaki Bhaya
- Carnegie Institution for Science, Department of Plant Biology, Stanford, CA 94305, United States
| | - Rayan Chikhi
- Institut Pasteur, Université Paris Cité, Sequence Bioinformatics unit, 75015 Paris, France
| | - Eduardo P.C. Rocha
- Institut Pasteur, Université Paris Cité, CNRS, UMR3525, Microbial Evolutionary Genomics, 75015 Paris, France
| | - Todd J. Treangen
- Rice University, Department of Computer Science, Houston, TX 77005, United States
| |
Collapse
|
3
|
Balaji A, Sapoval N, Seto C, Leo Elworth R, Fu Y, Nute MG, Savidge T, Segarra S, Treangen TJ. KOMB: K-core based de novo characterization of copy number variation in microbiomes. Comput Struct Biotechnol J 2022; 20:3208-3222. [PMID: 35832621 PMCID: PMC9249589 DOI: 10.1016/j.csbj.2022.06.019] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 06/08/2022] [Accepted: 06/09/2022] [Indexed: 11/29/2022] Open
Abstract
Characterizing metagenomes via kmer-based, database-dependent taxonomic classification has yielded key insights into underlying microbiome dynamics. However, novel approaches are needed to track community dynamics and genomic flux within metagenomes, particularly in response to perturbations. We describe KOMB, a novel method for tracking genome level dynamics within microbiomes. KOMB utilizes K-core decomposition to identify Structural variations (SVs), specifically, population-level Copy Number Variation (CNV) within microbiomes. K-core decomposition partitions the graph into shells containing nodes of induced degree at least K, yielding reduced computational complexity compared to prior approaches. Through validation on a synthetic community, we show that KOMB recovers and profiles repetitive genomic regions in the sample. KOMB is shown to identify functionally-important regions in Human Microbiome Project datasets, and was used to analyze longitudinal data and identify keystone taxa in Fecal Microbiota Transplantation (FMT) samples. In summary, KOMB represents a novel graph-based, taxonomy-oblivious, and reference-free approach for tracking CNV within microbiomes. KOMB is open source and available for download at https://gitlab.com/treangenlab/komb.
Collapse
Affiliation(s)
- Advait Balaji
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Nicolae Sapoval
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Charlie Seto
- Department of Pathology and Immunology, Baylor College of Medicine, Houston, TX, USA
| | - R.A. Leo Elworth
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Yilei Fu
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Michael G. Nute
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Tor Savidge
- Department of Pathology and Immunology, Baylor College of Medicine, Houston, TX, USA
| | - Santiago Segarra
- Department of Electrical and Computer Engineering, Rice University, Houston, TX, USA
- Corresponding author.
| | - Todd J. Treangen
- Department of Computer Science, Rice University, Houston, TX, USA
- Corresponding author.
| |
Collapse
|
4
|
Alipanahi B, Muggli MD, Jundi M, Noyes NR, Boucher C. Metagenome SNP calling via read-colored de Bruijn graphs. Bioinformatics 2021; 36:5275-5281. [PMID: 32049324 DOI: 10.1093/bioinformatics/btaa081] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2018] [Revised: 01/08/2020] [Accepted: 02/03/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Metagenomics refers to the study of complex samples containing of genetic contents of multiple individual organisms and, thus, has been used to elucidate the microbiome and resistome of a complex sample. The microbiome refers to all microbial organisms in a sample, and the resistome refers to all of the antimicrobial resistance (AMR) genes in pathogenic and non-pathogenic bacteria. Single-nucleotide polymorphisms (SNPs) can be effectively used to 'fingerprint' specific organisms and genes within the microbiome and resistome and trace their movement across various samples. However, to effectively use these SNPs for this traceability, a scalable and accurate metagenomics SNP caller is needed. Moreover, such an SNP caller should not be reliant on reference genomes since 95% of microbial species is unculturable, making the determination of a reference genome extremely challenging. In this article, we address this need. RESULTS We present LueVari, a reference-free SNP caller based on the read-colored de Bruijn graph, an extension of the traditional de Bruijn graph that allows repeated regions longer than the k-mer length and shorter than the read length to be identified unambiguously. LueVari is able to identify SNPs in both AMR genes and chromosomal DNA from shotgun metagenomics data with reliable sensitivity (between 91% and 99%) and precision (between 71% and 99%) as the performance of competing methods varies widely. Furthermore, we show that LueVari constructs sequences containing the variation, which span up to 97.8% of genes in datasets, which can be helpful in detecting distinct AMR genes in large metagenomic datasets. AVAILABILITY AND IMPLEMENTATION Code and datasets are publicly available at https://github.com/baharpan/cosmo/tree/LueVari. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bahar Alipanahi
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Martin D Muggli
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Musa Jundi
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Noelle R Noyes
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Christina Boucher
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| |
Collapse
|
5
|
Kolmogorov M, Bickhart DM, Behsaz B, Gurevich A, Rayko M, Shin SB, Kuhn K, Yuan J, Polevikov E, Smith TPL, Pevzner PA. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat Methods 2020; 17:1103-1110. [PMID: 33020656 PMCID: PMC10699202 DOI: 10.1038/s41592-020-00971-x] [Citation(s) in RCA: 364] [Impact Index Per Article: 91.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2020] [Revised: 08/22/2020] [Accepted: 09/07/2020] [Indexed: 02/06/2023]
Abstract
Long-read sequencing technologies have substantially improved the assemblies of many isolate bacterial genomes as compared to fragmented short-read assemblies. However, assembling complex metagenomic datasets remains difficult even for state-of-the-art long-read assemblers. Here we present metaFlye, which addresses important long-read metagenomic assembly challenges, such as uneven bacterial composition and intra-species heterogeneity. First, we benchmarked metaFlye using simulated and mock bacterial communities and show that it consistently produces assemblies with better completeness and contiguity than state-of-the-art long-read assemblers. Second, we performed long-read sequencing of the sheep microbiome and applied metaFlye to reconstruct 63 complete or nearly complete bacterial genomes within single contigs. Finally, we show that long-read assembly of human microbiomes enables the discovery of full-length biosynthetic gene clusters that encode biomedically important natural products.
Collapse
Affiliation(s)
- Mikhail Kolmogorov
- Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Derek M Bickhart
- Cell Wall Biology and Utilization Laboratory, Dairy Forage Research Center, USDA, Madison, WI, USA
| | - Bahar Behsaz
- Graduate Program in Bioinformatics and System Biology, University of California, San Diego, CA, USA
| | - Alexey Gurevich
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia
| | - Mikhail Rayko
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia
| | - Sung Bong Shin
- USDA-ARS US Meat Animal Research Center, Clay Center, NE, USA
| | - Kristen Kuhn
- USDA-ARS US Meat Animal Research Center, Clay Center, NE, USA
| | - Jeffrey Yuan
- Graduate Program in Bioinformatics and System Biology, University of California, San Diego, CA, USA
| | - Evgeny Polevikov
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia
- Bioinformatics Institute, St. Petersburg, Russia
| | | | - Pavel A Pevzner
- Department of Computer Science and Engineering, University of California, San Diego, CA, USA.
- Center for Microbiome Innovation, University of California, San Diego, CA, USA.
| |
Collapse
|
6
|
Olson ND, Treangen TJ, Hill CM, Cepeda-Espinoza V, Ghurye J, Koren S, Pop M. Metagenomic assembly through the lens of validation: recent advances in assessing and improving the quality of genomes assembled from metagenomes. Brief Bioinform 2020; 20:1140-1150. [PMID: 28968737 DOI: 10.1093/bib/bbx098] [Citation(s) in RCA: 80] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2017] [Revised: 07/13/2017] [Indexed: 01/09/2023] Open
Abstract
Metagenomic samples are snapshots of complex ecosystems at work. They comprise hundreds of known and unknown species, contain multiple strain variants and vary greatly within and across environments. Many microbes found in microbial communities are not easily grown in culture making their DNA sequence our only clue into their evolutionary history and biological function. Metagenomic assembly is a computational process aimed at reconstructing genes and genomes from metagenomic mixtures. Current methods have made significant strides in reconstructing DNA segments comprising operons, tandem gene arrays and syntenic blocks. Shorter, higher-throughput sequencing technologies have become the de facto standard in the field. Sequencers are now able to generate billions of short reads in only a few days. Multiple metagenomic assembly strategies, pipelines and assemblers have appeared in recent years. Owing to the inherent complexity of metagenome assembly, regardless of the assembly algorithm and sequencing method, metagenome assemblies contain errors. Recent developments in assembly validation tools have played a pivotal role in improving metagenomics assemblers. Here, we survey recent progress in the field of metagenomic assembly, provide an overview of key approaches for genomic and metagenomic assembly validation and demonstrate the insights that can be derived from assemblies through the use of assembly validation strategies. We also discuss the potential for impact of long-read technologies in metagenomics. We conclude with a discussion of future challenges and opportunities in the field of metagenomic assembly and validation.
Collapse
|
7
|
Ghurye J, Treangen T, Fedarko M, Hervey WJ, Pop M. MetaCarvel: linking assembly graph motifs to biological variants. Genome Biol 2019; 20:174. [PMID: 31451112 PMCID: PMC6710874 DOI: 10.1186/s13059-019-1791-3] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2019] [Accepted: 08/13/2019] [Indexed: 01/01/2023] Open
Abstract
Reconstructing genomic segments from metagenomics data is a highly complex task. In addition to general challenges, such as repeats and sequencing errors, metagenomic assembly needs to tolerate the uneven depth of coverage among organisms in a community and differences between nearly identical strains. Previous methods have addressed these issues by smoothing genomic variants. We present a variant-aware metagenomic scaffolder called MetaCarvel, which combines new strategies for repeat detection with graph analytics for the discovery of variants. We show that MetaCarvel can accurately reconstruct genomic segments from complex microbial mixtures and correctly identify and characterize several classes of common genomic variants.
Collapse
Affiliation(s)
- Jay Ghurye
- Department of Computer Science, University of Maryland, College Park, MD, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
| | - Todd Treangen
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Marcus Fedarko
- Department of Computer Science, University of Maryland, College Park, MD, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
| | - W Judson Hervey
- Center for Bio/Molecular Science & Engineering, United States Naval Research Laboratory, Washington, DC, USA
| | - Mihai Pop
- Department of Computer Science, University of Maryland, College Park, MD, USA.
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA.
| |
Collapse
|
8
|
Lee H, Kingsford C. Kourami: graph-guided assembly for novel human leukocyte antigen allele discovery. Genome Biol 2018; 19:16. [PMID: 29415772 PMCID: PMC5804087 DOI: 10.1186/s13059-018-1388-2] [Citation(s) in RCA: 51] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2017] [Accepted: 01/08/2018] [Indexed: 01/07/2023] Open
Abstract
Accurate typing of human leukocyte antigen (HLA) is important because HLA genes play important roles in immune responses and disease genesis. Previously available computational methods are database-matching approaches and their outputs are inherently limited by the completeness of already known types, making them unsuitable for discovery of novel alleles. We have developed a graph-guided assembly technique for classical HLA genes, which can construct allele sequences given high-coverage whole-genome sequencing data. Our method delivers highly accurate HLA typing, comparable to the current state-of-the-art methods. Using various data, we also demonstrate that our method can type novel alleles.
Collapse
Affiliation(s)
- Heewook Lee
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Carl Kingsford
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA.
| |
Collapse
|
9
|
Peering into the Genetic Makeup of Natural Microbial Populations Using Metagenomics. POPULATION GENOMICS: MICROORGANISMS 2018. [DOI: 10.1007/13836_2018_14] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
|
10
|
Sheikhizadeh S, Schranz ME, Akdel M, de Ridder D, Smit S. PanTools: representation, storage and exploration of pan-genomic data. Bioinformatics 2017; 32:i487-i493. [PMID: 27587666 DOI: 10.1093/bioinformatics/btw455] [Citation(s) in RCA: 30] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Next-generation sequencing technology is generating a wealth of highly similar genome sequences for many species, paving the way for a transition from single-genome to pan-genome analyses. Accordingly, genomics research is going to switch from reference-centric to pan-genomic approaches. We define the pan-genome as a comprehensive representation of multiple annotated genomes, facilitating analyses on the similarity and divergence of the constituent genomes at the nucleotide, gene and genome structure level. Current pan-genomic approaches do not thoroughly address scalability, functionality and usability. RESULTS We introduce a generalized De Bruijn graph as a pan-genome representation, as well as an online algorithm to construct it. This representation is stored in a Neo4j graph database, which makes our approach scalable to large eukaryotic genomes. Besides the construction algorithm, our software package, called PanTools, currently provides functionality for annotating pan-genomes, adding sequences, grouping genes, retrieving gene sequences or genomic regions, reconstructing genomes and comparing and querying pan-genomes. We demonstrate the performance of the tool using datasets of 62 E. coli genomes, 93 yeast genomes and 19 Arabidopsis thaliana genomes. AVAILABILITY AND IMPLEMENTATION The Java implementation of PanTools is publicly available at http://www.bif.wur.nl CONTACT sandra.smit@wur.nl.
Collapse
Affiliation(s)
- Siavash Sheikhizadeh
- Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708PB, Wageningen, The Netherlands
| | - M Eric Schranz
- Biosystematics Group, Wageningen University, Droevendaalsesteeg 1, 6708PB, The Netherlands
| | - Mehmet Akdel
- Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708PB, Wageningen, The Netherlands
| | - Dick de Ridder
- Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708PB, Wageningen, The Netherlands
| | - Sandra Smit
- Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708PB, Wageningen, The Netherlands
| |
Collapse
|
11
|
Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile metagenomic assembler. Genome Res 2017; 27:824-834. [PMID: 28298430 PMCID: PMC5411777 DOI: 10.1101/gr.213959.116] [Citation(s) in RCA: 2180] [Impact Index Per Article: 311.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2016] [Accepted: 03/13/2017] [Indexed: 01/25/2023]
Abstract
While metagenomics has emerged as a technology of choice for analyzing bacterial populations, the assembly of metagenomic data remains challenging, thus stifling biological discoveries. Moreover, recent studies revealed that complex bacterial populations may be composed from dozens of related strains, thus further amplifying the challenge of metagenomic assembly. metaSPAdes addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes. We benchmark metaSPAdes against other state-of-the-art metagenome assemblers and demonstrate that it results in high-quality assemblies across diverse data sets.
Collapse
Affiliation(s)
- Sergey Nurk
- Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia 199004
| | - Dmitry Meleshko
- Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia 199004
| | - Anton Korobeynikov
- Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia 199004.,Department of Statistical Modelling, St. Petersburg State University, St. Petersburg, Russia 198515
| | - Pavel A Pevzner
- Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia 199004.,Department of Computer Science and Engineering, University of California, San Diego, California 92093-0404, USA
| |
Collapse
|
12
|
Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 2017; 27:722-736. [PMID: 28298431 PMCID: PMC5411767 DOI: 10.1101/gr.215087.116] [Citation(s) in RCA: 4403] [Impact Index Per Article: 629.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2016] [Accepted: 03/03/2017] [Indexed: 12/11/2022]
Abstract
Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on both human and Drosophila melanogaster PacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.
Collapse
Affiliation(s)
- Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Brian P Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | | | - Jason R Miller
- J. Craig Venter Institute, Rockville, Maryland 20850, USA
| | - Nicholas H Bergman
- National Biodefense Analysis and Countermeasures Center, Frederick, Maryland 21702, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| |
Collapse
|
13
|
Ji P, Zhang Y, Wang J, Zhao F. MetaSort untangles metagenome assembly by reducing microbial community complexity. Nat Commun 2017; 8:14306. [PMID: 28112173 PMCID: PMC5264255 DOI: 10.1038/ncomms14306] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2016] [Accepted: 12/14/2016] [Indexed: 12/31/2022] Open
Abstract
Most current approaches to analyse metagenomic data rely on reference genomes. Novel microbial communities extend far beyond the coverage of reference databases and de novo metagenome assembly from complex microbial communities remains a great challenge. Here we present a novel experimental and bioinformatic framework, metaSort, for effective construction of bacterial genomes from metagenomic samples. MetaSort provides a sorted mini-metagenome approach based on flow cytometry and single-cell sequencing methodologies, and employs new computational algorithms to efficiently recover high-quality genomes from the sorted mini-metagenome by the complementary of the original metagenome. Through extensive evaluations, we demonstrated that metaSort has an excellent and unbiased performance on genome recovery and assembly. Furthermore, we applied metaSort to an unexplored microflora colonized on the surface of marine kelp and successfully recovered 75 high-quality genomes at one time. This approach will greatly improve access to microbial genomes from complex or novel communities.
Collapse
Affiliation(s)
- Peifeng Ji
- Computational Genomics Lab, Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing 100101, China
| | - Yanming Zhang
- Computational Genomics Lab, Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing 100101, China
| | - Jinfeng Wang
- Computational Genomics Lab, Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing 100101, China
| | - Fangqing Zhao
- Computational Genomics Lab, Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing 100101, China
| |
Collapse
|
14
|
Ghurye JS, Cepeda-Espinoza V, Pop M. Metagenomic Assembly: Overview, Challenges and Applications. THE YALE JOURNAL OF BIOLOGY AND MEDICINE 2016; 89:353-362. [PMID: 27698619 PMCID: PMC5045144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Advances in sequencing technologies have led to the increased use of high throughput sequencing in characterizing the microbial communities associated with our bodies and our environment. Critical to the analysis of the resulting data are sequence assembly algorithms able to reconstruct genes and organisms from complex mixtures. Metagenomic assembly involves new computational challenges due to the specific characteristics of the metagenomic data. In this survey, we focus on major algorithmic approaches for genome and metagenome assembly, and discuss the new challenges and opportunities afforded by this new field. We also review several applications of metagenome assembly in addressing interesting biological problems.
Collapse
Affiliation(s)
| | | | - Mihai Pop
- To whom all correspondence should be addressed: Mihai Pop, Department of Computer Science and Center of Bioinformatics and Computational Biology, University of Maryland, Center for Bioinformatics and Computational Biology, Biomolecular Sciences Building. Rm. 3120F, College Park, MD 20742, Phone Number: 301-405-7245,
| |
Collapse
|
15
|
Hiraoka S, Yang CC, Iwasaki W. Metagenomics and Bioinformatics in Microbial Ecology: Current Status and Beyond. Microbes Environ 2016; 31:204-12. [PMID: 27383682 PMCID: PMC5017796 DOI: 10.1264/jsme2.me16024] [Citation(s) in RCA: 44] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Metagenomic approaches are now commonly used in microbial ecology to study microbial communities in more detail, including many strains that cannot be cultivated in the laboratory. Bioinformatic analyses make it possible to mine huge metagenomic datasets and discover general patterns that govern microbial ecosystems. However, the findings of typical metagenomic and bioinformatic analyses still do not completely describe the ecology and evolution of microbes in their environments. Most analyses still depend on straightforward sequence similarity searches against reference databases. We herein review the current state of metagenomics and bioinformatics in microbial ecology and discuss future directions for the field. New techniques will allow us to go beyond routine analyses and broaden our knowledge of microbial ecosystems. We need to enrich reference databases, promote platforms that enable meta- or comprehensive analyses of diverse metagenomic datasets, devise methods that utilize long-read sequence information, and develop more powerful bioinformatic methods to analyze data from diverse perspectives.
Collapse
Affiliation(s)
- Satoshi Hiraoka
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, the University of Tokyo
| | | | | |
Collapse
|
16
|
Kuleshov V, Jiang C, Zhou W, Jahanbani F, Batzoglou S, Snyder M. Synthetic long-read sequencing reveals intraspecies diversity in the human microbiome. Nat Biotechnol 2015; 34:64-9. [PMID: 26655498 PMCID: PMC4884093 DOI: 10.1038/nbt.3416] [Citation(s) in RCA: 84] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2014] [Accepted: 10/23/2015] [Indexed: 01/30/2023]
Abstract
Identifying bacterial strains in metagenome and microbiome samples using computational analyses of short-read sequence remains a difficult problem. Here, we present an analysis of a human gut microbiome using on Tru-seq synthetic long reads combined with new computational tools for metagenomic long-read assembly, variant-calling and haplotyping (Nanoscope and Lens). Our analysis identifies 178 bacterial species of which 51 were not found using short sequence reads alone. We recover bacterial contigs that comprise multiple operons, including 22 contigs of >1Mbp. Extensive intraspecies variation among microbial strains in the form of haplotypes that span up to hundreds of Kbp can be observed using our approach. Our method incorporates synthetic long-read sequencing technology with standard shotgun approaches to move towards rapid, precise and comprehensive analyses of metagenome and microbiome samples.
Collapse
Affiliation(s)
- Volodymyr Kuleshov
- Department of Computer Science, Stanford University, Stanford, California, USA.,Department of Genetics, Stanford University School of Medicine, Stanford, California, USA
| | - Chao Jiang
- Department of Genetics, Stanford University School of Medicine, Stanford, California, USA
| | - Wenyu Zhou
- Department of Genetics, Stanford University School of Medicine, Stanford, California, USA
| | - Fereshteh Jahanbani
- Department of Genetics, Stanford University School of Medicine, Stanford, California, USA
| | - Serafim Batzoglou
- Department of Computer Science, Stanford University, Stanford, California, USA
| | - Michael Snyder
- Department of Genetics, Stanford University School of Medicine, Stanford, California, USA
| |
Collapse
|
17
|
ConStrains identifies microbial strains in metagenomic datasets. Nat Biotechnol 2015; 33:1045-52. [PMID: 26344404 PMCID: PMC4676274 DOI: 10.1038/nbt.3319] [Citation(s) in RCA: 157] [Impact Index Per Article: 17.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2014] [Accepted: 07/16/2015] [Indexed: 01/21/2023]
Abstract
An important fraction of microbial diversity is harbored in strain individuality, so identification of conspecific bacterial strains is imperative for improved understanding of microbial community functions. Limitations in bioinformatics and sequencing technologies have to date precluded strain identification owing to difficulties in phasing short reads to faithfully recover the original strain-level genotypes, which have highly similar sequences. We present ConStrains, an open-source algorithm that identifies conspecific strains from metagenomic sequence data and reconstructs the phylogeny of these strains in microbial communities. The algorithm uses single-nucleotide polymorphism (SNP) patterns in a set of universal genes to infer within-species structures that represent strains. Applying ConStrains to simulated and host-derived datasets provides insights into microbial community dynamics.
Collapse
|
18
|
Ye Y, Tang H. Utilizing de Bruijn graph of metagenome assembly for metatranscriptome analysis. Bioinformatics 2015; 32:1001-8. [PMID: 26319390 PMCID: PMC4896364 DOI: 10.1093/bioinformatics/btv510] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2015] [Accepted: 08/24/2015] [Indexed: 11/26/2022] Open
Abstract
Motivation: Metagenomics research has accelerated the studies of microbial organisms, providing insights into the composition and potential functionality of various microbial communities. Metatranscriptomics (studies of the transcripts from a mixture of microbial species) and other meta-omics approaches hold even greater promise for providing additional insights into functional and regulatory characteristics of the microbial communities. Current metatranscriptomics projects are often carried out without matched metagenomic datasets (of the same microbial communities). For the projects that produce both metatranscriptomic and metagenomic datasets, their analyses are often not integrated. Metagenome assemblies are far from perfect, partially explaining why metagenome assemblies are not used for the analysis of metatranscriptomic datasets. Results: Here, we report a reads mapping algorithm for mapping of short reads onto a de Bruijn graph of assemblies. A hash table of junction k-mers (k-mers spanning branching structures in the de Bruijn graph) is used to facilitate fast mapping of reads to the graph. We developed an application of this mapping algorithm: a reference-based approach to metatranscriptome assembly using graphs of metagenome assembly as the reference. Our results show that this new approach (called TAG) helps to assemble substantially more transcripts that otherwise would have been missed or truncated because of the fragmented nature of the reference metagenome. Availability and implementation: TAG was implemented in C++ and has been tested extensively on the Linux platform. It is available for download as open source at http://omics.informatics.indiana.edu/TAG. Contact:yye@indiana.edu
Collapse
Affiliation(s)
- Yuzhen Ye
- School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA
| | - Haixu Tang
- School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA
| |
Collapse
|
19
|
Abstract
The current genomic revolution was made possible by joint advances in genome sequencing technologies and computational approaches for analyzing sequence data. The close interaction between biologists and computational scientists is perhaps most apparent in the development of approaches for sequencing entire genomes, a feat that would not be possible without sophisticated computational tools called genome assemblers (short for genome sequence assemblers). Here, we survey the key developments in algorithms for assembling genome sequences since the development of the first DNA sequencing methods more than 35 years ago.
Collapse
Affiliation(s)
- Jared T Simpson
- Ontario Institute for Cancer Research, Toronto, Ontario M5G 0A3, Canada;
| | | |
Collapse
|
20
|
One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Curr Opin Microbiol 2014; 23:110-20. [PMID: 25461581 DOI: 10.1016/j.mib.2014.11.014] [Citation(s) in RCA: 265] [Impact Index Per Article: 26.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2014] [Revised: 11/17/2014] [Accepted: 11/18/2014] [Indexed: 11/20/2022]
Abstract
Like a jigsaw puzzle with large pieces, a genome sequenced with long reads is easier to assemble. However, recent sequencing technologies have favored lowering per-base cost at the expense of read length. This has dramatically reduced sequencing cost, but resulted in fragmented assemblies, which negatively affect downstream analyses and hinder the creation of finished (gapless, high-quality) genomes. In contrast, emerging long-read sequencing technologies can now produce reads tens of kilobases in length, enabling the automated finishing of microbial genomes for under $1000. This promises to improve the quality of reference databases and facilitate new studies of chromosomal structure and variation. We present an overview of these new technologies and the methods used to assemble long reads into complete genomes.
Collapse
|
21
|
Abstract
Reference-free SNP detection, that is identifying SNPs between samples directly from comparison of primary sequencing data with other primary sequencing data and not to a pre-assembled reference genome is an emergent and potentially disruptive technology that is beginning to open up new vistas in variant identification that reveals new applications in non-model organisms and metagenomics. The modern, effcient data structures these tools use enables researchers with a reference sequence to sample many more individuals with lower computing storage and processing overhead. In this article we will discuss the technologies and tools implementing reference-free SNP detection and the potential impact on studies of genetic variation in model and non-model organisms, metagenomics and personal genomics and medicine.
Collapse
|
22
|
Morgan XC, Huttenhower C. Meta'omic analytic techniques for studying the intestinal microbiome. Gastroenterology 2014; 146:1437-1448.e1. [PMID: 24486053 DOI: 10.1053/j.gastro.2014.01.049] [Citation(s) in RCA: 194] [Impact Index Per Article: 19.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/07/2013] [Revised: 01/03/2014] [Accepted: 01/24/2014] [Indexed: 12/16/2022]
Abstract
Nucleotide sequencing has become increasingly common and affordable, and is now a vital tool for studies of the human microbiome. Comprehensive microbial community surveys such as MetaHit and the Human Microbiome Project have described the composition and molecular functional profile of the healthy (normal) intestinal microbiome. This knowledge will increase our ability to analyze host and microbial DNA (genome) and RNA (transcriptome) sequences. Bioinformatic and statistical tools then can be used to identify dysbioses that might cause disease, and potential treatments. Analyses that identify perturbations in specific molecules can leverage thousands of culture-based isolate genomes to contextualize culture-independent sequences, or may integrate sequence data with whole-community functional assays such as metaproteomic or metabolomic analyses. We review the state of available systems-level models for studies of the intestinal microbiome, along with analytic techniques and tools that can be used to determine its functional capabilities in healthy and unhealthy individuals.
Collapse
Affiliation(s)
- Xochitl C Morgan
- Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts; The Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, Massachusetts
| | - Curtis Huttenhower
- Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts; The Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, Massachusetts.
| |
Collapse
|