1
|
Aplakidou E, Vergoulidis N, Chasapi M, Venetsianou NK, Kokoli M, Panagiotopoulou E, Iliopoulos I, Karatzas E, Pafilis E, Georgakopoulos-Soares I, Kyrpides NC, Pavlopoulos GA, Baltoumas FA. Visualizing metagenomic and metatranscriptomic data: A comprehensive review. Comput Struct Biotechnol J 2024; 23:2011-2033. [PMID: 38765606 PMCID: PMC11101950 DOI: 10.1016/j.csbj.2024.04.060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2024] [Revised: 04/25/2024] [Accepted: 04/25/2024] [Indexed: 05/22/2024] Open
Abstract
The fields of Metagenomics and Metatranscriptomics involve the examination of complete nucleotide sequences, gene identification, and analysis of potential biological functions within diverse organisms or environmental samples. Despite the vast opportunities for discovery in metagenomics, the sheer volume and complexity of sequence data often present challenges in processing analysis and visualization. This article highlights the critical role of advanced visualization tools in enabling effective exploration, querying, and analysis of these complex datasets. Emphasizing the importance of accessibility, the article categorizes various visualizers based on their intended applications and highlights their utility in empowering bioinformaticians and non-bioinformaticians to interpret and derive insights from meta-omics data effectively.
Collapse
Affiliation(s)
- Eleni Aplakidou
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece
- Department of Informatics and Telecommunications, Data Science and Information Technologies program, University of Athens, 15784 Athens, Greece
| | - Nikolaos Vergoulidis
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece
| | - Maria Chasapi
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece
- Department of Informatics and Telecommunications, Data Science and Information Technologies program, University of Athens, 15784 Athens, Greece
| | - Nefeli K. Venetsianou
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece
| | - Maria Kokoli
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece
| | - Eleni Panagiotopoulou
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece
- Department of Informatics and Telecommunications, Data Science and Information Technologies program, University of Athens, 15784 Athens, Greece
| | - Ioannis Iliopoulos
- Department of Basic Sciences, School of Medicine, University of Crete, 71003 Heraklion, Greece
| | - Evangelos Karatzas
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Evangelos Pafilis
- Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Greece
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikos C. Kyrpides
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Georgios A. Pavlopoulos
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Center of New Biotechnologies & Precision Medicine, Department of Medicine, School of Health Sciences, National and Kapodistrian University of Athens, Greece
- Hellenic Army Academy, 16673 Vari, Greece
| | - Fotis A. Baltoumas
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece
| |
Collapse
|
2
|
Azizpour A, Balaji A, Treangen TJ, Segarra S. Graph-based self-supervised learning for repeat detection in metagenomic assembly. Genome Res 2024; 34:1468-1476. [PMID: 39029947 DOI: 10.1101/gr.279136.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Accepted: 07/15/2024] [Indexed: 07/21/2024]
Abstract
Repetitive DNA (repeats) poses significant challenges for accurate and efficient genome assembly and sequence alignment. This is particularly true for metagenomic data, in which genome dynamics such as horizontal gene transfer, gene duplication, and gene loss/gain complicate accurate genome assembly from metagenomic communities. Detecting repeats is a crucial first step in overcoming these challenges. To address this issue, we propose GraSSRep, a novel approach that leverages the assembly graph's structure through graph neural networks (GNNs) within a self-supervised learning framework to classify DNA sequences into repetitive and nonrepetitive categories. Specifically, we frame this problem as a node classification task within a metagenomic assembly graph. In a self-supervised fashion, we rely on a high-precision (but low-recall) heuristic to generate pseudolabels for a small proportion of the nodes. We then use those pseudolabels to train a GNN embedding and a random forest classifier to propagate the labels to the remaining nodes. In this way, GraSSRep combines sequencing features with predefined and learned graph features to achieve state-of-the-art performance in repeat detection. We evaluate our method using simulated and synthetic metagenomic data sets. The results on the simulated data highlight GraSSRep's robustness to repeat attributes, demonstrating its effectiveness in handling the complexity of repeated sequences. Additionally, experiments with synthetic metagenomic data sets reveal that incorporating the graph structure and the GNN enhances the detection performance. Finally, in comparative analyses, GraSSRep outperforms existing repeat detection tools with respect to precision and recall.
Collapse
Affiliation(s)
- Ali Azizpour
- Department of Electrical and Computer Engineering, Houston, Texas 77005, USA;
| | - Advait Balaji
- Department of Computer Science, Rice University, Houston, Texas 77005, USA;
| | - Todd J Treangen
- Department of Computer Science, Rice University, Houston, Texas 77005, USA;
- Ken Kennedy Institute, Rice University, Houston, Texas 77005, USA
| | - Santiago Segarra
- Department of Electrical and Computer Engineering, Houston, Texas 77005, USA;
- Ken Kennedy Institute, Rice University, Houston, Texas 77005, USA
| |
Collapse
|
3
|
Kazantseva E, Donmez A, Frolova M, Pop M, Kolmogorov M. Strainy: phasing and assembly of strain haplotypes from long-read metagenome sequencing. Nat Methods 2024:10.1038/s41592-024-02424-1. [PMID: 39327484 DOI: 10.1038/s41592-024-02424-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Accepted: 08/22/2024] [Indexed: 09/28/2024]
Abstract
Bacterial species in microbial communities are often represented by mixtures of strains, distinguished by small variations in their genomes. Short-read approaches can be used to detect small-scale variation between strains but fail to phase these variants into contiguous haplotypes. Long-read metagenome assemblers can generate contiguous bacterial chromosomes but often suppress strain-level variation in favor of species-level consensus. Here we present Strainy, an algorithm for strain-level metagenome assembly and phasing from Nanopore and PacBio reads. Strainy takes a de novo metagenomic assembly as input and identifies strain variants, which are then phased and assembled into contiguous haplotypes. Using simulated and mock Nanopore and PacBio metagenome data, we show that Strainy assembles accurate and complete strain haplotypes, outperforming current Nanopore-based methods and comparable with PacBio-based algorithms in completeness and accuracy. We then use Strainy to assemble strain haplotypes of a complex environmental metagenome, revealing distinct strain distribution and mutational patterns in bacterial species.
Collapse
Affiliation(s)
- Ekaterina Kazantseva
- Bioinformatics and Systems Biology Program, ITMO University, St. Petersburg, Russia
| | - Ataberk Donmez
- Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Maria Frolova
- Functional Genomics of Prokaryotes Laboratory, Institute of Cell Biophysics, RAS, Pushchino, Russia
| | - Mihai Pop
- Department of Computer Science, University of Maryland, College Park, MD, USA.
| | - Mikhail Kolmogorov
- Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
4
|
Curry KD, Yu FB, Vance SE, Segarra S, Bhaya D, Chikhi R, Rocha EPC, Treangen TJ. Reference-free structural variant detection in microbiomes via long-read co-assembly graphs. Bioinformatics 2024; 40:i58-i67. [PMID: 38940156 PMCID: PMC11211843 DOI: 10.1093/bioinformatics/btae224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION The study of bacterial genome dynamics is vital for understanding the mechanisms underlying microbial adaptation, growth, and their impact on host phenotype. Structural variants (SVs), genomic alterations of 50 base pairs or more, play a pivotal role in driving evolutionary processes and maintaining genomic heterogeneity within bacterial populations. While SV detection in isolate genomes is relatively straightforward, metagenomes present broader challenges due to the absence of clear reference genomes and the presence of mixed strains. In response, our proposed method rhea, forgoes reference genomes and metagenome-assembled genomes (MAGs) by encompassing all metagenomic samples in a series (time or other metric) into a single co-assembly graph. The log fold change in graph coverage between successive samples is then calculated to call SVs that are thriving or declining. RESULTS We show rhea to outperform existing methods for SV and horizontal gene transfer (HGT) detection in two simulated mock metagenomes, particularly as the simulated reads diverge from reference genomes and an increase in strain diversity is incorporated. We additionally demonstrate use cases for rhea on series metagenomic data of environmental and fermented food microbiomes to detect specific sequence alterations between successive time and temperature samples, suggesting host advantage. Our approach leverages previous work in assembly graph structural and coverage patterns to provide versatility in studying SVs across diverse and poorly characterized microbial communities for more comprehensive insights into microbial gene flux. AVAILABILITY AND IMPLEMENTATION rhea is open source and available at: https://github.com/treangenlab/rhea.
Collapse
Affiliation(s)
- Kristen D Curry
- Department of Computer Science, Rice University, 6100 Main St., Houston, TX 77005, United States
- Department of Genomes and Genetics, Microbial Evolutionary Genomics, Institut Pasteur, Université Paris Cité, CNRS, UMR3525, Paris 75015, France
| | | | - Summer E Vance
- Department of Environmental Science, Policy, and Management, University of California, Berkeley, CA 94720, United States
| | - Santiago Segarra
- Department of Electrical and Computer Engineering, Rice University, Houston, TX 77005, United States
| | - Devaki Bhaya
- Carnegie Institution for Science, Department of Plant Biology, Stanford, CA 94305, United States
| | - Rayan Chikhi
- Department of Computational Biology, Institut Pasteur, Université Paris Cité, Paris 75015, France
| | - Eduardo P C Rocha
- Department of Genomes and Genetics, Microbial Evolutionary Genomics, Institut Pasteur, Université Paris Cité, CNRS, UMR3525, Paris 75015, France
| | - Todd J Treangen
- Department of Computer Science, Rice University, 6100 Main St., Houston, TX 77005, United States
| |
Collapse
|
5
|
Agustinho DP, Fu Y, Menon VK, Metcalf GA, Treangen TJ, Sedlazeck FJ. Unveiling microbial diversity: harnessing long-read sequencing technology. Nat Methods 2024; 21:954-966. [PMID: 38689099 DOI: 10.1038/s41592-024-02262-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Accepted: 03/29/2024] [Indexed: 05/02/2024]
Abstract
Long-read sequencing has recently transformed metagenomics, enhancing strain-level pathogen characterization, enabling accurate and complete metagenome-assembled genomes, and improving microbiome taxonomic classification and profiling. These advancements are not only due to improvements in sequencing accuracy, but also happening across rapidly changing analysis methods. In this Review, we explore long-read sequencing's profound impact on metagenomics, focusing on computational pipelines for genome assembly, taxonomic characterization and variant detection, to summarize recent advancements in the field and provide an overview of available analytical methods to fully leverage long reads. We provide insights into the advantages and disadvantages of long reads over short reads and their evolution from the early days of long-read sequencing to their recent impact on metagenomics and clinical diagnostics. We further point out remaining challenges for the field such as the integration of methylation signals in sub-strain analysis and the lack of benchmarks.
Collapse
Affiliation(s)
- Daniel P Agustinho
- Human Genome Sequencing center, Baylor College of Medicine, Houston, TX, USA
| | - Yilei Fu
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Vipin K Menon
- Human Genome Sequencing center, Baylor College of Medicine, Houston, TX, USA
- Senior research project manager, Human Genetics, Genentech, South San Francisco, CA, USA
| | - Ginger A Metcalf
- Human Genome Sequencing center, Baylor College of Medicine, Houston, TX, USA
| | - Todd J Treangen
- Department of Computer Science, Rice University, Houston, TX, USA
- Department of Bioengineering, Rice University, Houston, TX, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing center, Baylor College of Medicine, Houston, TX, USA.
- Department of Computer Science, Rice University, Houston, TX, USA.
| |
Collapse
|
6
|
Curry KD, Yu FB, Vance SE, Segarra S, Bhaya D, Chikhi R, Rocha EP, Treangen TJ. Reference-free Structural Variant Detection in Microbiomes via Long-read Coassembly Graphs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.25.577285. [PMID: 38352454 PMCID: PMC10862772 DOI: 10.1101/2024.01.25.577285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/20/2024]
Abstract
Bacterial genome dynamics are vital for understanding the mechanisms underlying microbial adaptation, growth, and their broader impact on host phenotype. Structural variants (SVs), genomic alterations of 10 base pairs or more, play a pivotal role in driving evolutionary processes and maintaining genomic heterogeneity within bacterial populations. While SV detection in isolate genomes is relatively straightforward, metagenomes present broader challenges due to absence of clear reference genomes and presence of mixed strains. In response, our proposed method rhea, forgoes reference genomes and metagenome-assembled genomes (MAGs) by encompassing a single metagenome coassembly graph constructed from all samples in a series. The log fold change in graph coverage between subsequent samples is then calculated to call SVs that are thriving or declining throughout the series. We show rhea to outperform existing methods for SV and horizontal gene transfer (HGT) detection in two simulated mock metagenomes, which is particularly noticeable as the simulated reads diverge from reference genomes and an increase in strain diversity is incorporated. We additionally demonstrate use cases for rhea on series metagenomic data of environmental and fermented food microbiomes to detect specific sequence alterations between subsequent time and temperature samples, suggesting host advantage. Our innovative approach leverages raw read patterns rather than references or MAGs to include all sequencing reads in analysis, and thus provide versatility in studying SVs across diverse and poorly characterized microbial communities for more comprehensive insights into microbial genome dynamics.
Collapse
Affiliation(s)
- Kristen D. Curry
- Rice University, Department of Computer Science, Houston, TX 77005, United States
- Institut Pasteur, Université Paris Cité, CNRS, UMR3525, Microbial Evolutionary Genomics, 75015 Paris, France
| | | | - Summer E. Vance
- University of California, Berkeley, Department of Environmental Science, Policy, and Management, Berkeley, CA 94720, United States
| | - Santiago Segarra
- Rice University, Department of Electrical and Computer Engineering, Houston, TX 77005, United States
| | - Devaki Bhaya
- Carnegie Institution for Science, Department of Plant Biology, Stanford, CA 94305, United States
| | - Rayan Chikhi
- Institut Pasteur, Université Paris Cité, Sequence Bioinformatics unit, 75015 Paris, France
| | - Eduardo P.C. Rocha
- Institut Pasteur, Université Paris Cité, CNRS, UMR3525, Microbial Evolutionary Genomics, 75015 Paris, France
| | - Todd J. Treangen
- Rice University, Department of Computer Science, Houston, TX 77005, United States
| |
Collapse
|
7
|
Sapoval N, Tanevski M, Treangen TJ. KombOver: Efficient k-core and K-truss based characterization of perturbations within the human gut microbiome. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2024; 29:506-520. [PMID: 38160303 PMCID: PMC10764071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/03/2024]
Abstract
The microbes present in the human gastrointestinal tract are regularly linked to human health and disease outcomes. Thanks to technological and methodological advances in recent years, metagenomic sequencing data, and computational methods designed to analyze metagenomic data, have contributed to improved understanding of the link between the human gut microbiome and disease. However, while numerous methods have been recently developed to extract quantitative and qualitative results from host-associated microbiome data, improved computational tools are still needed to track microbiome dynamics with short-read sequencing data. Previously we have proposed KOMB as a de novo tool for identifying copy number variations in metagenomes for characterizing microbial genome dynamics in response to perturbations. In this work, we present KombOver (KO), which includes four key contributions with respect to our previous work: (i) it scales to large microbiome study cohorts, (ii) it includes both k-core and K-truss based analysis, (iii) we provide the foundation of a theoretical understanding of the relation between various graph-based metagenome representations, and (iv) we provide an improved user experience with easier-to-run code and more descriptive outputs/results. To highlight the aforementioned benefits, we applied KO to nearly 1000 human microbiome samples, requiring less than 10 minutes and 10 GB RAM per sample to process these data. Furthermore, we highlight how graph-based approaches such as k-core and K-truss can be informative for pinpointing microbial community dynamics within a myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) cohort. KO is open source and available for download/use at: https://github.com/treangenlab/komb.
Collapse
Affiliation(s)
- Nicolae Sapoval
- Department of Computer Science, Rice University, Houston, TX 77005, USA,
| | | | | |
Collapse
|
8
|
Shen K, Din AU, Sinha B, Zhou Y, Qian F, Shen B. Translational informatics for human microbiota: data resources, models and applications. Brief Bioinform 2023; 24:7152256. [PMID: 37141135 DOI: 10.1093/bib/bbad168] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2022] [Revised: 04/07/2023] [Accepted: 04/11/2023] [Indexed: 05/05/2023] Open
Abstract
With the rapid development of human intestinal microbiology and diverse microbiome-related studies and investigations, a large amount of data have been generated and accumulated. Meanwhile, different computational and bioinformatics models have been developed for pattern recognition and knowledge discovery using these data. Given the heterogeneity of these resources and models, we aimed to provide a landscape of the data resources, a comparison of the computational models and a summary of the translational informatics applied to microbiota data. We first review the existing databases, knowledge bases, knowledge graphs and standardizations of microbiome data. Then, the high-throughput sequencing techniques for the microbiome and the informatics tools for their analyses are compared. Finally, translational informatics for the microbiome, including biomarker discovery, personalized treatment and smart healthcare for complex diseases, are discussed.
Collapse
Affiliation(s)
- Ke Shen
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
| | - Ahmad Ud Din
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
| | - Baivab Sinha
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
| | - Yi Zhou
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
| | - Fuliang Qian
- Center for Systems Biology, Suzhou Medical College of Soochow University, Suzhou 215123, China
- Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Suzhou 215123, China
| | - Bairong Shen
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
| |
Collapse
|
9
|
Javkar K, Rand H, Strain E, Pop M. PRAWNS: compact pan-genomic features for whole-genome population genomics. Bioinformatics 2022; 39:6965020. [PMID: 36579850 PMCID: PMC9825322 DOI: 10.1093/bioinformatics/btac844] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2022] [Revised: 11/09/2022] [Accepted: 12/28/2022] [Indexed: 12/30/2022] Open
Abstract
MOTIVATION Scientists seeking to understand the genomic basis of bacterial phenotypes, such as antibiotic resistance, today have access to an unprecedented number of complete and nearly complete genomes. Making sense of these data requires computational tools able to perform multiple-genome comparisons efficiently, yet currently available tools cannot scale beyond several tens of genomes. RESULTS We describe PRAWNS, an efficient and scalable tool for multiple-genome analysis. PRAWNS defines a concise set of genomic features (metablocks), as well as pairwise relationships between them, which can be used as a basis for large-scale genotype-phenotype association studies. We demonstrate the effectiveness of PRAWNS by identifying genomic regions associated with antibiotic resistance in Acinetobacter baumannii. AVAILABILITY AND IMPLEMENTATION PRAWNS is implemented in C++ and Python3, licensed under the GPLv3 license, and freely downloadable from GitHub (https://github.com/KiranJavkar/PRAWNS.git). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kiran Javkar
- Department of Computer Science, University of Maryland, College Park, MD 20742, USA,Joint Institute for Food Safety and Applied Nutrition, University of Maryland, College Park, MD 20740, USA
| | - Hugh Rand
- Center for Food Safety and Applied Nutrition, United States Food and Drug Administration, College Park, MD 20740, USA
| | - Errol Strain
- Center for Veterinary Medicine, United States Food and Drug Administration, Laurel, MD 20708, USA
| | - Mihai Pop
- To whom correspondence should be addressed.
| |
Collapse
|
10
|
Balaji A, Sapoval N, Seto C, Leo Elworth R, Fu Y, Nute MG, Savidge T, Segarra S, Treangen TJ. KOMB: K-core based de novo characterization of copy number variation in microbiomes. Comput Struct Biotechnol J 2022; 20:3208-3222. [PMID: 35832621 PMCID: PMC9249589 DOI: 10.1016/j.csbj.2022.06.019] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 06/08/2022] [Accepted: 06/09/2022] [Indexed: 11/29/2022] Open
Abstract
Characterizing metagenomes via kmer-based, database-dependent taxonomic classification has yielded key insights into underlying microbiome dynamics. However, novel approaches are needed to track community dynamics and genomic flux within metagenomes, particularly in response to perturbations. We describe KOMB, a novel method for tracking genome level dynamics within microbiomes. KOMB utilizes K-core decomposition to identify Structural variations (SVs), specifically, population-level Copy Number Variation (CNV) within microbiomes. K-core decomposition partitions the graph into shells containing nodes of induced degree at least K, yielding reduced computational complexity compared to prior approaches. Through validation on a synthetic community, we show that KOMB recovers and profiles repetitive genomic regions in the sample. KOMB is shown to identify functionally-important regions in Human Microbiome Project datasets, and was used to analyze longitudinal data and identify keystone taxa in Fecal Microbiota Transplantation (FMT) samples. In summary, KOMB represents a novel graph-based, taxonomy-oblivious, and reference-free approach for tracking CNV within microbiomes. KOMB is open source and available for download at https://gitlab.com/treangenlab/komb.
Collapse
Affiliation(s)
- Advait Balaji
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Nicolae Sapoval
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Charlie Seto
- Department of Pathology and Immunology, Baylor College of Medicine, Houston, TX, USA
| | - R.A. Leo Elworth
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Yilei Fu
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Michael G. Nute
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Tor Savidge
- Department of Pathology and Immunology, Baylor College of Medicine, Houston, TX, USA
| | - Santiago Segarra
- Department of Electrical and Computer Engineering, Rice University, Houston, TX, USA
- Corresponding author.
| | - Todd J. Treangen
- Department of Computer Science, Rice University, Houston, TX, USA
- Corresponding author.
| |
Collapse
|
11
|
Javkar K, Rand H, Hoffmann M, Luo Y, Sarria S, Thirunavukkarasu N, Pillai CA, McGann P, Johnson JK, Strain E, Pop M. Whole-Genome Assessment of Clinical Acinetobacter baumannii Isolates Uncovers Potentially Novel Factors Influencing Carbapenem Resistance. Front Microbiol 2021; 12:714284. [PMID: 34659144 PMCID: PMC8518998 DOI: 10.3389/fmicb.2021.714284] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Accepted: 09/01/2021] [Indexed: 12/30/2022] Open
Abstract
Carbapenems-one of the important last-line antibiotics for the treatment of gram-negative infections-are becoming ineffective for treating Acinetobacter baumannii infections. Studies have identified multiple genes (and mechanisms) responsible for carbapenem resistance. In some A. baumannii strains, the presence/absence of putative resistance genes is not consistent with their resistance phenotype-indicating the genomic factors underlying carbapenem resistance in A. baumannii are not fully understood. Here, we describe a large-scale whole-genome genotype-phenotype association study with 349 A. baumannii isolates that extends beyond the presence/absence of individual antimicrobial resistance genes and includes the genomic positions and pairwise interactions of genes. Ten known resistance genes exhibited statistically significant associations with resistance to imipenem, a type of carbapenem: blaOXA-23, qacEdelta1, sul1, mphE, msrE, ant(3")-II, aacC1, yafP, aphA6, and xerD. A review of the strains without any of these 10 genes uncovered a clade of isolates with diverse imipenem resistance phenotypes. Finer resolution evaluation of this clade revealed the presence of a 38.6 kbp conserved chromosomal region found exclusively in imipenem-susceptible isolates. This region appears to host several HTH-type DNA binding transcriptional regulators and transporter genes. Imipenem-susceptible isolates from this clade also carried two mutually exclusive plasmids that contain genes previously known to be specific to imipenem-susceptible isolates. Our analysis demonstrates the utility of using whole genomes for genotype-phenotype correlations in the context of antibiotic resistance and provides several new hypotheses for future research.
Collapse
Affiliation(s)
- Kiran Javkar
- Department of Computer Science, University of Maryland, College Park, MD, United States.,Joint Institute for Food Safety and Applied Nutrition, University of Maryland, College Park, MD, United States
| | - Hugh Rand
- Center for Food Safety and Applied Nutrition, United States Food and Drug Administration, Department of Health and Human Services, College Park, MD, United States
| | - Maria Hoffmann
- Center for Food Safety and Applied Nutrition, United States Food and Drug Administration, Department of Health and Human Services, College Park, MD, United States
| | - Yan Luo
- Center for Food Safety and Applied Nutrition, United States Food and Drug Administration, Department of Health and Human Services, College Park, MD, United States
| | - Saul Sarria
- Center for Veterinary Medicine, United States Food and Drug Administration, Department of Health and Human Services, Laurel, MD, United States
| | - Nagarajan Thirunavukkarasu
- Center for Food Safety and Applied Nutrition, United States Food and Drug Administration, Department of Health and Human Services, College Park, MD, United States
| | - Christine A Pillai
- Center for Food Safety and Applied Nutrition, United States Food and Drug Administration, Department of Health and Human Services, College Park, MD, United States
| | - Patrick McGann
- Multidrug Resistant Organism Repository and Surveillance Network, Walter Reed Army Institute of Research, Silver Spring, MD, United States
| | - J Kristie Johnson
- Department of Pathology, University of Maryland School of Medicine, Baltimore, MD, United States
| | - Errol Strain
- Center for Veterinary Medicine, United States Food and Drug Administration, Department of Health and Human Services, Laurel, MD, United States
| | - Mihai Pop
- Department of Computer Science, University of Maryland, College Park, MD, United States
| |
Collapse
|
12
|
Commichaux S, Javkar K, Ramachandran P, Nagarajan N, Bertrand D, Chen Y, Reed E, Gonzalez-Escalona N, Strain E, Rand H, Pop M, Ottesen A. Evaluating the accuracy of Listeria monocytogenes assemblies from quasimetagenomic samples using long and short reads. BMC Genomics 2021; 22:389. [PMID: 34039264 PMCID: PMC8157722 DOI: 10.1186/s12864-021-07702-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Accepted: 05/11/2021] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Whole genome sequencing of cultured pathogens is the state of the art public health response for the bioinformatic source tracking of illness outbreaks. Quasimetagenomics can substantially reduce the amount of culturing needed before a high quality genome can be recovered. Highly accurate short read data is analyzed for single nucleotide polymorphisms and multi-locus sequence types to differentiate strains but cannot span many genomic repeats, resulting in highly fragmented assemblies. Long reads can span repeats, resulting in much more contiguous assemblies, but have lower accuracy than short reads. RESULTS We evaluated the accuracy of Listeria monocytogenes assemblies from enrichments (quasimetagenomes) of naturally-contaminated ice cream using long read (Oxford Nanopore) and short read (Illumina) sequencing data. Accuracy of ten assembly approaches, over a range of sequencing depths, was evaluated by comparing sequence similarity of genes in assemblies to a complete reference genome. Long read assemblies reconstructed a circularized genome as well as a 71 kbp plasmid after 24 h of enrichment; however, high error rates prevented high fidelity gene assembly, even at 150X depth of coverage. Short read assemblies accurately reconstructed the core genes after 28 h of enrichment but produced highly fragmented genomes. Hybrid approaches demonstrated promising results but had biases based upon the initial assembly strategy. Short read assemblies scaffolded with long reads accurately assembled the core genes after just 24 h of enrichment, but were highly fragmented. Long read assemblies polished with short reads reconstructed a circularized genome and plasmid and assembled all the genes after 24 h enrichment but with less fidelity for the core genes than the short read assemblies. CONCLUSION The integration of long and short read sequencing of quasimetagenomes expedited the reconstruction of a high quality pathogen genome compared to either platform alone. A new and more complete level of information about genome structure, gene order and mobile elements can be added to the public health response by incorporating long read analyses with the standard short read WGS outbreak response.
Collapse
Affiliation(s)
- Seth Commichaux
- Center for Food Safety and Applied Nutrition, Food and Drug Administration, Laurel, MD, USA.
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA.
- Biological Science Graduate Program, University of Maryland, College Park, MD, USA.
| | - Kiran Javkar
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
- Department of Computer Science, University of Maryland, College Park, MD, USA
- Joint Institute for Food Safety and Applied Nutrition, University of Maryland, College Park, MD, USA
| | - Padmini Ramachandran
- Center for Food Safety and Nutrition, Food and Drug Administration, College Park, MD, USA
| | - Niranjan Nagarajan
- Computational and Systems Biology, Genome Institute of Singapore, Singapore, 13862, Singapore
| | - Denis Bertrand
- Computational and Systems Biology, Genome Institute of Singapore, Singapore, 13862, Singapore
| | - Yi Chen
- Center for Food Safety and Nutrition, Food and Drug Administration, College Park, MD, USA
| | - Elizabeth Reed
- Center for Food Safety and Nutrition, Food and Drug Administration, College Park, MD, USA
| | | | - Errol Strain
- Center for Food Safety and Applied Nutrition, Food and Drug Administration, Laurel, MD, USA
| | - Hugh Rand
- Center for Food Safety and Nutrition, Food and Drug Administration, College Park, MD, USA
| | - Mihai Pop
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Andrea Ottesen
- Center for Veterinary Medicine, Food and Drug Administration, Laurel, MD, USA
| |
Collapse
|
13
|
Muralidharan HS, Shah N, Meisel JS, Pop M. Binnacle: Using Scaffolds to Improve the Contiguity and Quality of Metagenomic Bins. Front Microbiol 2021; 12:638561. [PMID: 33717033 PMCID: PMC7945042 DOI: 10.3389/fmicb.2021.638561] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2020] [Accepted: 02/04/2021] [Indexed: 01/03/2023] Open
Abstract
High-throughput sequencing has revolutionized the field of microbiology, however, reconstructing complete genomes of organisms from whole metagenomic shotgun sequencing data remains a challenge. Recovered genomes are often highly fragmented, due to uneven abundances of organisms, repeats within and across genomes, sequencing errors, and strain-level variation. To address the fragmented nature of metagenomic assemblies, scientists rely on a process called binning, which clusters together contigs inferred to originate from the same organism. Existing binning algorithms use oligonucleotide frequencies and contig abundance (coverage) within and across samples to group together contigs from the same organism. However, these algorithms often miss short contigs and contigs from regions with unusual coverage or DNA composition characteristics, such as mobile elements. Here, we propose that information from assembly graphs can assist current strategies for metagenomic binning. We use MetaCarvel, a metagenomic scaffolding tool, to construct assembly graphs where contigs are nodes and edges are inferred based on paired-end reads. We developed a tool, Binnacle, that extracts information from the assembly graphs and clusters scaffolds into comprehensive bins. Binnacle also provides wrapper scripts to integrate with existing binning methods. The Binnacle pipeline can be found on GitHub (https://github.com/marbl/binnacle). We show that binning graph-based scaffolds, rather than contigs, improves the contiguity and quality of the resulting bins, and captures a broader set of the genes of the organisms being reconstructed.
Collapse
Affiliation(s)
- Harihara Subrahmaniam Muralidharan
- Pop Lab, Department of Computer Science, Center for Bioinformatics and Computational Biology, UMIACS, University of Maryland, College Park, MD, United States
| | - Nidhi Shah
- Pop Lab, Department of Computer Science, Center for Bioinformatics and Computational Biology, UMIACS, University of Maryland, College Park, MD, United States
| | - Jacquelyn S Meisel
- Pop Lab, Department of Computer Science, Center for Bioinformatics and Computational Biology, UMIACS, University of Maryland, College Park, MD, United States
| | - Mihai Pop
- Pop Lab, Department of Computer Science, Center for Bioinformatics and Computational Biology, UMIACS, University of Maryland, College Park, MD, United States
| |
Collapse
|
14
|
Kolmogorov M, Bickhart DM, Behsaz B, Gurevich A, Rayko M, Shin SB, Kuhn K, Yuan J, Polevikov E, Smith TPL, Pevzner PA. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat Methods 2020; 17:1103-1110. [PMID: 33020656 PMCID: PMC10699202 DOI: 10.1038/s41592-020-00971-x] [Citation(s) in RCA: 356] [Impact Index Per Article: 89.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2020] [Revised: 08/22/2020] [Accepted: 09/07/2020] [Indexed: 02/06/2023]
Abstract
Long-read sequencing technologies have substantially improved the assemblies of many isolate bacterial genomes as compared to fragmented short-read assemblies. However, assembling complex metagenomic datasets remains difficult even for state-of-the-art long-read assemblers. Here we present metaFlye, which addresses important long-read metagenomic assembly challenges, such as uneven bacterial composition and intra-species heterogeneity. First, we benchmarked metaFlye using simulated and mock bacterial communities and show that it consistently produces assemblies with better completeness and contiguity than state-of-the-art long-read assemblers. Second, we performed long-read sequencing of the sheep microbiome and applied metaFlye to reconstruct 63 complete or nearly complete bacterial genomes within single contigs. Finally, we show that long-read assembly of human microbiomes enables the discovery of full-length biosynthetic gene clusters that encode biomedically important natural products.
Collapse
Affiliation(s)
- Mikhail Kolmogorov
- Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Derek M Bickhart
- Cell Wall Biology and Utilization Laboratory, Dairy Forage Research Center, USDA, Madison, WI, USA
| | - Bahar Behsaz
- Graduate Program in Bioinformatics and System Biology, University of California, San Diego, CA, USA
| | - Alexey Gurevich
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia
| | - Mikhail Rayko
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia
| | - Sung Bong Shin
- USDA-ARS US Meat Animal Research Center, Clay Center, NE, USA
| | - Kristen Kuhn
- USDA-ARS US Meat Animal Research Center, Clay Center, NE, USA
| | - Jeffrey Yuan
- Graduate Program in Bioinformatics and System Biology, University of California, San Diego, CA, USA
| | - Evgeny Polevikov
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia
- Bioinformatics Institute, St. Petersburg, Russia
| | | | - Pavel A Pevzner
- Department of Computer Science and Engineering, University of California, San Diego, CA, USA.
- Center for Microbiome Innovation, University of California, San Diego, CA, USA.
| |
Collapse
|
15
|
Nurk S, Walenz BP, Rhie A, Vollger MR, Logsdon GA, Grothe R, Miga KH, Eichler EE, Phillippy AM, Koren S. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res 2020; 30:1291-1305. [PMID: 32801147 PMCID: PMC7545148 DOI: 10.1101/gr.263566.120] [Citation(s) in RCA: 350] [Impact Index Per Article: 87.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2020] [Accepted: 08/04/2020] [Indexed: 12/14/2022]
Abstract
Complete and accurate genome assemblies form the basis of most downstream genomic analyses and are of critical importance. Recent genome assembly projects have relied on a combination of noisy long-read sequencing and accurate short-read sequencing, with the former offering greater assembly continuity and the latter providing higher consensus accuracy. The recently introduced Pacific Biosciences (PacBio) HiFi sequencing technology bridges this divide by delivering long reads (>10 kbp) with high per-base accuracy (>99.9%). Here we present HiCanu, a modification of the Canu assembler designed to leverage the full potential of HiFi reads via homopolymer compression, overlap-based error correction, and aggressive false overlap filtering. We benchmark HiCanu with a focus on the recovery of haplotype diversity, major histocompatibility complex (MHC) variants, satellite DNAs, and segmental duplications. For diploid human genomes sequenced to 30× HiFi coverage, HiCanu achieved superior accuracy and allele recovery compared to the current state of the art. On the effectively haploid CHM13 human cell line, HiCanu achieved an NG50 contig size of 77 Mbp with a per-base consensus accuracy of 99.999% (QV50), surpassing recent assemblies of high-coverage, ultralong Oxford Nanopore Technologies (ONT) reads in terms of both accuracy and continuity. This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of nine complete human centromeric regions. Although gaps and errors still remain within the most challenging regions of the genome, these results represent a significant advance toward the complete assembly of human genomes.
Collapse
Affiliation(s)
- Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Brian P Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
| | - Glennis A Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
| | - Robert Grothe
- Pacific Biosciences, Menlo Park, California 94025, USA
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California 95064, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA
| |
Collapse
|
16
|
Knight R, Ley RE, Raes J, Grice EA. Expanding the scope and scale of microbiome research. Genome Biol 2019; 20:191. [PMID: 31488207 PMCID: PMC6729039 DOI: 10.1186/s13059-019-1804-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2019] [Accepted: 08/27/2019] [Indexed: 11/10/2022] Open
Affiliation(s)
- Rob Knight
- Department of Pediatrics, University of California, Gilman Drive, La Jolla, San Diego, CA, 92093, USA. .,Center for Microbiome Innovation, Jacobs School of Engineering, University of California, Gilman Drive, La Jolla, San Diego, CA, 92093-0436, USA. .,Department of Computer Science and Engineering, University of California, Gilman Drive, La Jolla, San Diego, CA, 92093-0404, USA. .,Department of Bioengineering, University of California, La Jolla, San Diego, CA, 92093-0412, USA.
| | - Ruth E Ley
- Department of Microbiome Science, Max Planck Institute for Developmental Biology, Max Planck Ring, 72076, Tübingen, Germany
| | - Jeroen Raes
- Laboratory of Molecular Bacteriology, Department of Microbiology and Immunology, Rega Institute, KU Leuven, Herestraat, 3000, Leuven, Belgium.,VIB-KU Leuven Center for Microbiology, Campus Gasthuisberg, Rega Instituut, Herestraat, 3000, Leuven, Belgium
| | - Elizabeth A Grice
- Department of Dermatology and Microbiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
| |
Collapse
|
17
|
Ghurye J, Treangen T, Fedarko M, Hervey WJ, Pop M. MetaCarvel: linking assembly graph motifs to biological variants. Genome Biol 2019; 20:174. [PMID: 31451112 PMCID: PMC6710874 DOI: 10.1186/s13059-019-1791-3] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2019] [Accepted: 08/13/2019] [Indexed: 01/01/2023] Open
Abstract
Reconstructing genomic segments from metagenomics data is a highly complex task. In addition to general challenges, such as repeats and sequencing errors, metagenomic assembly needs to tolerate the uneven depth of coverage among organisms in a community and differences between nearly identical strains. Previous methods have addressed these issues by smoothing genomic variants. We present a variant-aware metagenomic scaffolder called MetaCarvel, which combines new strategies for repeat detection with graph analytics for the discovery of variants. We show that MetaCarvel can accurately reconstruct genomic segments from complex microbial mixtures and correctly identify and characterize several classes of common genomic variants.
Collapse
Affiliation(s)
- Jay Ghurye
- Department of Computer Science, University of Maryland, College Park, MD, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
| | - Todd Treangen
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Marcus Fedarko
- Department of Computer Science, University of Maryland, College Park, MD, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
| | - W Judson Hervey
- Center for Bio/Molecular Science & Engineering, United States Naval Research Laboratory, Washington, DC, USA
| | - Mihai Pop
- Department of Computer Science, University of Maryland, College Park, MD, USA.
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA.
| |
Collapse
|