1
|
Voolstra CR, Alderdice R, Colin L, Staab S, Apprill A, Raina JB. Standardized Methods to Assess the Impacts of Thermal Stress on Coral Reef Marine Life. ANNUAL REVIEW OF MARINE SCIENCE 2025; 17:193-226. [PMID: 39116436 DOI: 10.1146/annurev-marine-032223-024511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/10/2024]
Abstract
The Earth's oceans have absorbed more than 90% of the excess, climate change-induced atmospheric heat. The resulting rise in oceanic temperatures affects all species and can lead to the collapse of marine ecosystems, including coral reefs. Here, we review the range of methods used to measure thermal stress impacts on reef-building corals, highlighting current standardization practices and necessary refinements to fast-track discoveries and improve interstudy comparisons. We also present technological developments that will undoubtedly enhance our ability to record and analyze standardized data. Although we use corals as an example, the methods described are widely employed in marine sciences, and our recommendations therefore apply to all species and ecosystems. Enhancing collaborative data collection efforts, implementing field-wide standardized protocols, and ensuring data availability through dedicated, openly accessible databases will enable large-scale analysis and monitoring of ecosystem changes, improving our predictive capacities and informing active intervention to mitigate climate change effects on marine life.
Collapse
Affiliation(s)
| | - Rachel Alderdice
- Department of Biology, University of Konstanz, Konstanz, Germany;
| | - Luigi Colin
- Department of Biology, University of Konstanz, Konstanz, Germany;
| | - Sebastian Staab
- Department of Biology, University of Konstanz, Konstanz, Germany;
| | - Amy Apprill
- Department of Marine Chemistry and Geochemistry, Woods Hole Oceanographic Institution, Woods Hole, Massachusetts, USA
| | - Jean-Baptiste Raina
- Climate Change Cluster, University of Technology Sydney, Ultimo, New South Wales, Australia;
| |
Collapse
|
2
|
Zhang Y, Zheng X, Yan W, Wang D, Chen X, Wang Y, Zhang T. Method evaluation for viruses in activated sludge: Concentration, sequencing, and identification. THE SCIENCE OF THE TOTAL ENVIRONMENT 2024; 955:176886. [PMID: 39419205 DOI: 10.1016/j.scitotenv.2024.176886] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/19/2024] [Revised: 10/09/2024] [Accepted: 10/10/2024] [Indexed: 10/19/2024]
Abstract
Activated sludge (AS) in wastewater treatment plants is one of the largest artificial microbial ecosystems on earth and it makes enormous contributions to human societies. Viruses are an important component in AS with a high abundance. However, their communities and functionalities have not been as widely explored as those of other microorganisms, such as bacteria. This gap is mainly due to technical challenges in effective viral concentration, extraction, and sequencing. In this study, we compared four kinds of concentration methods, two sequencing approaches, and four identification bioinformatic tools to evaluate the whole analysis workflow for viruses in AS. Results showed flocculation, filtration, and resuspension (FFR) could get the longest DNA lengths and ultracentrifugation obtained the highest DNA yields for viruses in AS. Based on the results of present study, FFR and tangential flow filtration with the membrane pore size of 100 kDa were most recommended to concentrate viruses in AS samples with huge volumes. Besides, different concentration methods could get different viral catalogs and thus multiple methods should be combined to get the whole picture of viruses in the system. In addition, geNomad was the most recommended identification tool for viruses in the present study and the long-read sequencing could improve the assembly statistics of viruses when compared with the short-read sequencing. For the 8192 viral operational taxonomic units in this study, 95.1 % of them were phages and belonged to the same lineage at the order level of Caudovirales. Virulent phages dominated the AS system and Pseudomonadota were the main host. Taken together, this study provides new insights into methods selection for virus research of AS.
Collapse
Affiliation(s)
- Yulin Zhang
- Environmental Microbiome Engineering and Biotechnology Lab, Department of Civil Engineering, The University of Hong Kong, Pokfulam, Road, Hong Kong, China
| | - Xiawan Zheng
- Environmental Microbiome Engineering and Biotechnology Lab, Department of Civil Engineering, The University of Hong Kong, Pokfulam, Road, Hong Kong, China
| | - Weifu Yan
- Environmental Microbiome Engineering and Biotechnology Lab, Department of Civil Engineering, The University of Hong Kong, Pokfulam, Road, Hong Kong, China
| | - Dou Wang
- Environmental Microbiome Engineering and Biotechnology Lab, Department of Civil Engineering, The University of Hong Kong, Pokfulam, Road, Hong Kong, China
| | - Xi Chen
- Environmental Microbiome Engineering and Biotechnology Lab, Department of Civil Engineering, The University of Hong Kong, Pokfulam, Road, Hong Kong, China
| | - Yulin Wang
- Environmental Microbiome Engineering and Biotechnology Lab, Department of Civil Engineering, The University of Hong Kong, Pokfulam, Road, Hong Kong, China
| | - Tong Zhang
- Environmental Microbiome Engineering and Biotechnology Lab, Department of Civil Engineering, The University of Hong Kong, Pokfulam, Road, Hong Kong, China; School of Public Health, The University of Hong Kong, Pokfulam Road, Hong Kong, China; Macau Institute of Applied Research in Medicine and Health, Macau University of Science and Technology, Macao.
| |
Collapse
|
3
|
Weber CC. Disentangling cobionts and contamination in long-read genomic data using sequence composition. G3 (BETHESDA, MD.) 2024; 14:jkae187. [PMID: 39148415 PMCID: PMC11540323 DOI: 10.1093/g3journal/jkae187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/30/2024] [Revised: 08/02/2024] [Accepted: 08/02/2024] [Indexed: 08/17/2024]
Abstract
The recent acceleration in genome sequencing targeting previously unexplored parts of the tree of life presents computational challenges. Samples collected from the wild often contain sequences from several organisms, including the target, its cobionts, and contaminants. Effective methods are therefore needed to separate sequences. Though advances in sequencing technology make this task easier, it remains difficult to taxonomically assign sequences from eukaryotic taxa that are not well represented in databases. Therefore, reference-based methods alone are insufficient. Here, I examine how we can take advantage of differences in sequence composition between organisms to identify symbionts, parasites, and contaminants in samples, with minimal reliance on reference data. To this end, I explore data from the Darwin Tree of Life project, including hundreds of high-quality HiFi read sets from insects. Visualizing two-dimensional representations of read tetranucleotide composition learned by a variational autoencoder can reveal distinct components of a sample. Annotating the embeddings with additional information, such as coding density, estimated coverage, or taxonomic labels allows rapid assessment of the contents of a dataset. The approach scales to millions of sequences, making it possible to explore unassembled read sets, even for large genomes. Combined with interactive visualization tools, it allows a large fraction of cobionts reported by reference-based screening to be identified. Crucially, it also facilitates retrieving genomes for which suitable reference data are absent.
Collapse
Affiliation(s)
- Claudia C Weber
- Tree of Life, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK
| |
Collapse
|
4
|
Hegarty B, Riddell V J, Bastien E, Langenfeld K, Lindback M, Saini JS, Wing A, Zhang J, Duhaime M. Benchmarking informatics approaches for virus discovery: caution is needed when combining in silico identification methods. mSystems 2024; 9:e0110523. [PMID: 38376167 PMCID: PMC10949488 DOI: 10.1128/msystems.01105-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Accepted: 01/24/2024] [Indexed: 02/21/2024] Open
Abstract
Understanding the ecological impacts of viruses on natural and engineered ecosystems relies on the accurate identification of viral sequences from community sequencing data. To maximize viral recovery from metagenomes, researchers frequently combine viral identification tools. However, the effectiveness of this strategy is unknown. Here, we benchmarked combinations of six widely used informatics tools for viral identification and analysis (VirSorter, VirSorter2, VIBRANT, DeepVirFinder, CheckV, and Kaiju), called "rulesets." Rulesets were tested against mock metagenomes composed of taxonomically diverse sequence types and diverse aquatic metagenomes to assess the effects of the degree of viral enrichment and habitat on tool performance. We found that six rulesets achieved equivalent accuracy [Matthews Correlation Coefficient (MCC) = 0.77, Padj ≥ 0.05]. Each contained VirSorter2, and five used our "tuning removal" rule designed to remove non-viral contamination. While DeepVirFinder, VIBRANT, and VirSorter were each found once in these high-accuracy rulesets, they were not found in combination with each other: combining tools does not lead to optimal performance. Our validation suggests that the MCC plateau at 0.77 is partly caused by inaccurate labeling within reference sequence databases. In aquatic metagenomes, our highest MCC ruleset identified more viral sequences in virus-enriched (44%-46%) than in cellular metagenomes (7%-19%). While improved algorithms may lead to more accurate viral identification tools, this should be done in tandem with careful curation of sequence databases. We recommend using the VirSorter2 ruleset and our empirically derived tuning removal rule. Our analysis provides insight into methods for in silico viral identification and will enable more robust viral identification from metagenomic data sets. IMPORTANCE The identification of viruses from environmental metagenomes using informatics tools has offered critical insights in microbial ecology. However, it remains difficult for researchers to know which tools optimize viral recovery for their specific study. In an attempt to recover more viruses, studies are increasingly combining the outputs from multiple tools without validating this approach. After benchmarking combinations of six viral identification tools against mock metagenomes and environmental samples, we found that these tools should only be combined cautiously. Two to four tool combinations maximized viral recovery and minimized non-viral contamination compared with either the single-tool or the five- to six-tool ones. By providing a rigorous overview of the behavior of in silico viral identification strategies and a pipeline to replicate our process, our findings guide the use of existing viral identification tools and offer a blueprint for feature engineering of new tools that will lead to higher-confidence viral discovery in microbiome studies.
Collapse
Affiliation(s)
- Bridget Hegarty
- Department of Civil and Environmental Engineering, Case Western Reserve University, Cleveland, Ohio, USA
| | - James Riddell V
- Department of Microbiology, The Ohio State University, Columbus, Ohio, USA
| | - Eric Bastien
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, USA
| | - Kathryn Langenfeld
- Department of Civil and Environmental Engineering, Stanford University, Palo Alto, California, USA
| | - Morgan Lindback
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, USA
| | - Jaspreet S. Saini
- Laboratory for Environmental Biotechnology, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | - Anthony Wing
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, USA
| | - Jessica Zhang
- Department of Civil and Environmental Engineering, University of Michigan, Ann Arbor, Michigan, USA
| | - Melissa Duhaime
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
5
|
Roach MJ, Beecroft SJ, Mihindukulasuriya KA, Wang L, Paredes A, Cárdenas LAC, Henry-Cocks K, Lima LFO, Dinsdale EA, Edwards RA, Handley SA. Hecatomb: an integrated software platform for viral metagenomics. Gigascience 2024; 13:giae020. [PMID: 38832467 PMCID: PMC11148595 DOI: 10.1093/gigascience/giae020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 01/18/2024] [Accepted: 04/08/2024] [Indexed: 06/05/2024] Open
Abstract
BACKGROUND Modern sequencing technologies offer extraordinary opportunities for virus discovery and virome analysis. Annotation of viral sequences from metagenomic data requires a complex series of steps to ensure accurate annotation of individual reads and assembled contigs. In addition, varying study designs will require project-specific statistical analyses. FINDINGS Here we introduce Hecatomb, a bioinformatic platform coordinating commonly used tasks required for virome analysis. Hecatomb means "a great sacrifice." In this setting, Hecatomb is "sacrificing" false-positive viral annotations using extensive quality control and tiered-database searches. Hecatomb processes metagenomic data obtained from both short- and long-read sequencing technologies, providing annotations to individual sequences and assembled contigs. Results are provided in commonly used data formats useful for downstream analysis. Here we demonstrate the functionality of Hecatomb through the reanalysis of a primate enteric and a novel coral reef virome. CONCLUSION Hecatomb provides an integrated platform to manage many commonly used steps for virome characterization, including rigorous quality control, host removal, and both read- and contig-based analysis. Each step is managed using the Snakemake workflow manager with dependency management using Conda. Hecatomb outputs several tables properly formatted for immediate use within popular data analysis and visualization tools, enabling effective data interpretation for a variety of study designs. Hecatomb is hosted on GitHub (github.com/shandley/hecatomb) and is available for installation from Bioconda and PyPI.
Collapse
Affiliation(s)
- Michael J Roach
- Flinders Accelerator for Microbiome Exploration, Flinders University, Adelaide, SA, Australia
- Adelaide Centre for Epigenetics, University of Adelaide, Adelaide, SA, 5005, Australia
- South Australian Immunogenomics Cancer Institute, University of Adelaide, Adelaide, SA, 5005, Australia
| | - Sarah J Beecroft
- Harry Perkins Institute of Medical Research, Perth, WA, 6009, Australia
| | - Kathie A Mihindukulasuriya
- Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, 63110, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO, 63110, USA
| | - Leran Wang
- Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, 63110, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO, 63110, USA
| | - Anne Paredes
- Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, 63110, USA
| | - Luis Alberto Chica Cárdenas
- Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, 63110, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO, 63110, USA
| | - Kara Henry-Cocks
- Flinders Accelerator for Microbiome Exploration, Flinders University, Adelaide, SA, Australia
| | | | - Elizabeth A Dinsdale
- Flinders Accelerator for Microbiome Exploration, Flinders University, Adelaide, SA, Australia
| | - Robert A Edwards
- Flinders Accelerator for Microbiome Exploration, Flinders University, Adelaide, SA, Australia
| | - Scott A Handley
- Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, 63110, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO, 63110, USA
| |
Collapse
|
6
|
Vik D, Bolduc B, Roux S, Sun CL, Pratama AA, Krupovic M, Sullivan MB. MArVD2: a machine learning enhanced tool to discriminate between archaeal and bacterial viruses in viral datasets. ISME COMMUNICATIONS 2023; 3:87. [PMID: 37620369 PMCID: PMC10449787 DOI: 10.1038/s43705-023-00295-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 08/04/2023] [Accepted: 08/09/2023] [Indexed: 08/26/2023]
Abstract
Our knowledge of viral sequence space has exploded with advancing sequencing technologies and large-scale sampling and analytical efforts. Though archaea are important and abundant prokaryotes in many systems, our knowledge of archaeal viruses outside of extreme environments is limited. This largely stems from the lack of a robust, high-throughput, and systematic way to distinguish between bacterial and archaeal viruses in datasets of curated viruses. Here we upgrade our prior text-based tool (MArVD) via training and testing a random forest machine learning algorithm against a newly curated dataset of archaeal viruses. After optimization, MArVD2 presented a significant improvement over its predecessor in terms of scalability, usability, and flexibility, and will allow user-defined custom training datasets as archaeal virus discovery progresses. Benchmarking showed that a model trained with viral sequences from the hypersaline, marine, and hot spring environments correctly classified 85% of the archaeal viruses with a false detection rate below 2% using a random forest prediction threshold of 80% in a separate benchmarking dataset from the same habitats.
Collapse
Affiliation(s)
- Dean Vik
- Department of Microbiology, The Ohio State University, Columbus, OH, 43210, USA.
- Center of Microbiome Science, The Ohio State University, Columbus, OH, USA.
| | - Benjamin Bolduc
- Department of Microbiology, The Ohio State University, Columbus, OH, 43210, USA
- Center of Microbiome Science, The Ohio State University, Columbus, OH, USA
| | - Simon Roux
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Christine L Sun
- Department of Microbiology, The Ohio State University, Columbus, OH, 43210, USA
- Center of Microbiome Science, The Ohio State University, Columbus, OH, USA
| | - Akbar Adjie Pratama
- Department of Microbiology, The Ohio State University, Columbus, OH, 43210, USA
- Center of Microbiome Science, The Ohio State University, Columbus, OH, USA
| | - Mart Krupovic
- Archaeal Virology Unit, Institut Pasteur, Université Paris Cité, CNRS UMR6047, Paris, France
| | - Matthew B Sullivan
- Department of Microbiology, The Ohio State University, Columbus, OH, 43210, USA.
- Center of Microbiome Science, The Ohio State University, Columbus, OH, USA.
- Department of Civil, Environmental and Geodetic Engineering, The Ohio State University, Columbus, OH, USA.
| |
Collapse
|
7
|
Doss RK, Palmer M, Mead DA, Hedlund BP. Functional biology and biotechnology of thermophilic viruses. Essays Biochem 2023; 67:671-684. [PMID: 37222046 PMCID: PMC10423840 DOI: 10.1042/ebc20220209] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Revised: 04/28/2023] [Accepted: 05/09/2023] [Indexed: 05/25/2023]
Abstract
Viruses have developed sophisticated biochemical and genetic mechanisms to manipulate and exploit their hosts. Enzymes derived from viruses have been essential research tools since the first days of molecular biology. However, most viral enzymes that have been commercialized are derived from a small number of cultivated viruses, which is remarkable considering the extraordinary diversity and abundance of viruses revealed by metagenomic analysis. Given the explosion of new enzymatic reagents derived from thermophilic prokaryotes over the past 40 years, those obtained from thermophilic viruses should be equally potent tools. This review discusses the still-limited state of the art regarding the functional biology and biotechnology of thermophilic viruses with a focus on DNA polymerases, ligases, endolysins, and coat proteins. Functional analysis of DNA polymerases and primase-polymerases from phages infecting Thermus, Aquificaceae, and Nitratiruptor has revealed new clades of enzymes with strong proofreading and reverse transcriptase capabilities. Thermophilic RNA ligase 1 homologs have been characterized from Rhodothermus and Thermus phages, with both commercialized for circularization of single-stranded templates. Endolysins from phages infecting Thermus, Meiothermus, and Geobacillus have shown high stability and unusually broad lytic activity against Gram-negative and Gram-positive bacteria, making them targets for commercialization as antimicrobials. Coat proteins from thermophilic viruses infecting Sulfolobales and Thermus strains have been characterized, with diverse potential applications as molecular shuttles. To gauge the scale of untapped resources for these proteins, we also document over 20,000 genes encoded by uncultivated viral genomes from high-temperature environments that encode DNA polymerase, ligase, endolysin, or coat protein domains.
Collapse
Affiliation(s)
- Ryan K Doss
- School of Life Sciences, University of Nevada, Las Vegas, Las Vegas, Nevada, U.S.A
| | - Marike Palmer
- School of Life Sciences, University of Nevada, Las Vegas, Las Vegas, Nevada, U.S.A
| | | | - Brian P Hedlund
- School of Life Sciences, University of Nevada, Las Vegas, Las Vegas, Nevada, U.S.A
- Nevada Institute of Personalized Medicine, Las Vegas, Nevada, U.S.A
| |
Collapse
|
8
|
Ho SFS, Wheeler NE, Millard AD, van Schaik W. Gauge your phage: benchmarking of bacteriophage identification tools in metagenomic sequencing data. MICROBIOME 2023; 11:84. [PMID: 37085924 PMCID: PMC10120246 DOI: 10.1186/s40168-023-01533-x] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/11/2022] [Accepted: 03/22/2023] [Indexed: 05/03/2023]
Abstract
BACKGROUND The prediction of bacteriophage sequences in metagenomic datasets has become a topic of considerable interest, leading to the development of many novel bioinformatic tools. A comparative analysis of ten state-of-the-art phage identification tools was performed to inform their usage in microbiome research. METHODS Artificial contigs generated from complete RefSeq genomes representing phages, plasmids, and chromosomes, and a previously sequenced mock community containing four phage species, were used to evaluate the precision, recall, and F1 scores of the tools. We also generated a dataset of randomly shuffled sequences to quantify false-positive calls. In addition, a set of previously simulated viromes was used to assess diversity bias in each tool's output. RESULTS VIBRANT and VirSorter2 achieved the highest F1 scores (0.93) in the RefSeq artificial contigs dataset, with several other tools also performing well. Kraken2 had the highest F1 score (0.86) in the mock community benchmark by a large margin (0.3 higher than DeepVirFinder in second place), mainly due to its high precision (0.96). Generally, k-mer-based tools performed better than reference similarity tools and gene-based methods. Several tools, most notably PPR-Meta, called a high number of false positives in the randomly shuffled sequences. When analysing the diversity of the genomes that each tool predicted from a virome set, most tools produced a viral genome set that had similar alpha- and beta-diversity patterns to the original population, with Seeker being a notable exception. CONCLUSIONS This study provides key metrics used to assess performance of phage detection tools, offers a framework for further comparison of additional viral discovery tools, and discusses optimal strategies for using these tools. We highlight that the choice of tool for identification of phages in metagenomic datasets, as well as their parameters, can bias the results and provide pointers for different use case scenarios. We have also made our benchmarking dataset available for download in order to facilitate future comparisons of phage identification tools. Video Abstract.
Collapse
Affiliation(s)
- Siu Fung Stanley Ho
- Institute of Microbiology and Infection, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
| | - Nicole E. Wheeler
- Institute of Microbiology and Infection, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
| | - Andrew D. Millard
- Department of Genetics and Genome Biology, University of Leicester, Leicester, UK
| | - Willem van Schaik
- Institute of Microbiology and Infection, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
| |
Collapse
|
9
|
Schackart KE, Graham JB, Ponsero AJ, Hurwitz BL. Evaluation of computational phage detection tools for metagenomic datasets. Front Microbiol 2023; 14:1078760. [PMID: 36760501 PMCID: PMC9902911 DOI: 10.3389/fmicb.2023.1078760] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Accepted: 01/09/2023] [Indexed: 01/25/2023] Open
Abstract
Introduction As new computational tools for detecting phage in metagenomes are being rapidly developed, a critical need has emerged to develop systematic benchmarks. Methods In this study, we surveyed 19 metagenomic phage detection tools, 9 of which could be installed and run at scale. Those 9 tools were assessed on several benchmark challenges. Fragmented reference genomes are used to assess the effects of fragment length, low viral content, phage taxonomy, robustness to eukaryotic contamination, and computational resource usage. Simulated metagenomes are used to assess the effects of sequencing and assembly quality on the tool performances. Finally, real human gut metagenomes and viromes are used to assess the differences and similarities in the phage communities predicted by the tools. Results We find that the various tools yield strikingly different results. Generally, tools that use a homology approach (VirSorter, MARVEL, viralVerify, VIBRANT, and VirSorter2) demonstrate low false positive rates and robustness to eukaryotic contamination. Conversely, tools that use a sequence composition approach (VirFinder, DeepVirFinder, Seeker), and MetaPhinder, have higher sensitivity, including to phages with less representation in reference databases. These differences led to widely differing predicted phage communities in human gut metagenomes, with nearly 80% of contigs being marked as phage by at least one tool and a maximum overlap of 38.8% between any two tools. While the results were more consistent among the tools on viromes, the differences in results were still significant, with a maximum overlap of 60.65%. Discussion: Importantly, the benchmark datasets developed in this study are publicly available and reusable to enable the future comparability of new tools developed.
Collapse
Affiliation(s)
- Kenneth E. Schackart
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ, United States
| | - Jessica B. Graham
- BIO5 Institute, The University of Arizona, Tucson, AZ, United States
| | - Alise J. Ponsero
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ, United States
- BIO5 Institute, The University of Arizona, Tucson, AZ, United States
- Human Microbiome Research Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland
| | - Bonnie L. Hurwitz
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ, United States
- BIO5 Institute, The University of Arizona, Tucson, AZ, United States
| |
Collapse
|
10
|
Camargo AP, Nayfach S, Chen IMA, Palaniappan K, Ratner A, Chu K, Ritter S, Reddy TBK, Mukherjee S, Schulz F, Call L, Neches R, Woyke T, Ivanova N, Eloe-Fadrosh E, Kyrpides N, Roux S. IMG/VR v4: an expanded database of uncultivated virus genomes within a framework of extensive functional, taxonomic, and ecological metadata. Nucleic Acids Res 2023; 51:D733-D743. [PMID: 36399502 PMCID: PMC9825611 DOI: 10.1093/nar/gkac1037] [Citation(s) in RCA: 126] [Impact Index Per Article: 63.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Revised: 10/15/2022] [Accepted: 10/25/2022] [Indexed: 11/19/2022] Open
Abstract
Viruses are widely recognized as critical members of all microbiomes. Metagenomics enables large-scale exploration of the global virosphere, progressively revealing the extensive genomic diversity of viruses on Earth and highlighting the myriad of ways by which viruses impact biological processes. IMG/VR provides access to the largest collection of viral sequences obtained from (meta)genomes, along with functional annotation and rich metadata. A web interface enables users to efficiently browse and search viruses based on genome features and/or sequence similarity. Here, we present the fourth version of IMG/VR, composed of >15 million virus genomes and genome fragments, a ≈6-fold increase in size compared to the previous version. These clustered into 8.7 million viral operational taxonomic units, including 231 408 with at least one high-quality representative. Viral sequences in IMG/VR are now systematically identified from genomes, metagenomes, and metatranscriptomes using a new detection approach (geNomad), and IMG standard annotation are complemented with genome quality estimation using CheckV, taxonomic classification reflecting the latest taxonomic standards, and microbial host taxonomy prediction. IMG/VR v4 is available at https://img.jgi.doe.gov/vr, and the underlying data are available to download at https://genome.jgi.doe.gov/portal/IMG_VR.
Collapse
Affiliation(s)
- Antonio Pedro Camargo
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Stephen Nayfach
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - I-Min A Chen
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | | | - Anna Ratner
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Ken Chu
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Stephan J Ritter
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - T B K Reddy
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Supratim Mukherjee
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Frederik Schulz
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Lee Call
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Russell Y Neches
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Tanja Woyke
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Natalia N Ivanova
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Emiley A Eloe-Fadrosh
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Nikos C Kyrpides
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Simon Roux
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| |
Collapse
|
11
|
Liu Q, Liu F, Miao Y, He J, Dong T, Hou T, Liu Y. Virsearcher: Identifying Bacteriophages from Metagenomes by Combining Convolutional Neural Network and Gene Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:763-774. [PMID: 35316191 DOI: 10.1109/tcbb.2022.3161135] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Metagenome sequencing provides an unprecedented opportunity for the discovery of unknown microbes and viruses. A large number of phages and prokaryotes are mixed together in metagenomes. To study the influence of phages on human bodies and environments, it is of great significance to isolate phages from metagenomes. However, it is difficult to identify novel phages because of the diversity of their sequences and the frequent presence of short contigs in metagenomes. Here, virSearcher is developed to identify phages from metagenomes by combining the convolutional neural network (CNN) and the gene information of input sequences. Firstly, an input sequence is encoded in accordance with the different functions of its coding and the non-coding regions and then is converted into word embedding code through a word embedding layer before a convolutional layer. Meanwhile, the hit ratio of the virus genes is combined with the output of the CNN to further improve the performance of the network. The genes used by virSearcher consist of complete and incomplete genes. Experiments on several metagenomes have showed that, compared with others, virSearcher can significantly improve the performance for the identification of short sequences, while maintaining the performance for long ones. The source code of virSearcher is freely available from http://github.com/DrJackson18/virSearcher.
Collapse
|
12
|
Andrade-Martínez JS, Camelo Valera LC, Chica Cárdenas LA, Forero-Junco L, López-Leal G, Moreno-Gallego JL, Rangel-Pineros G, Reyes A. Computational Tools for the Analysis of Uncultivated Phage Genomes. Microbiol Mol Biol Rev 2022; 86:e0000421. [PMID: 35311574 PMCID: PMC9199400 DOI: 10.1128/mmbr.00004-21] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Over a century of bacteriophage research has uncovered a plethora of fundamental aspects of their biology, ecology, and evolution. Furthermore, the introduction of community-level studies through metagenomics has revealed unprecedented insights on the impact that phages have on a range of ecological and physiological processes. It was not until the introduction of viral metagenomics that we began to grasp the astonishing breadth of genetic diversity encompassed by phage genomes. Novel phage genomes have been reported from a diverse range of biomes at an increasing rate, which has prompted the development of computational tools that support the multilevel characterization of these novel phages based solely on their genome sequences. The impact of these technologies has been so large that, together with MAGs (Metagenomic Assembled Genomes), we now have UViGs (Uncultivated Viral Genomes), which are now officially recognized by the International Committee for the Taxonomy of Viruses (ICTV), and new taxonomic groups can now be created based exclusively on genomic sequence information. Even though the available tools have immensely contributed to our knowledge of phage diversity and ecology, the ongoing surge in software programs makes it challenging to keep up with them and the purpose each one is designed for. Therefore, in this review, we describe a comprehensive set of currently available computational tools designed for the characterization of phage genome sequences, focusing on five specific analyses: (i) assembly and identification of phage and prophage sequences, (ii) phage genome annotation, (iii) phage taxonomic classification, (iv) phage-host interaction analysis, and (v) phage microdiversity.
Collapse
Affiliation(s)
- Juan Sebastián Andrade-Martínez
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
| | - Laura Carolina Camelo Valera
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
| | - Luis Alberto Chica Cárdenas
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
| | - Laura Forero-Junco
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
- Department of Plant and Environmental Science, University of Copenhagen, Frederiksberg, Denmark
| | - Gamaliel López-Leal
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
| | - J. Leonardo Moreno-Gallego
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
- Department of Microbiome Science, Max Planck Institute for Developmental Biology, Tübingen, Germany
| | - Guillermo Rangel-Pineros
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
- The GLOBE Institute, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Alejandro Reyes
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri, USA
| |
Collapse
|
13
|
Kieft K, Anantharaman K. Virus genomics: what is being overlooked? Curr Opin Virol 2022; 53:101200. [DOI: 10.1016/j.coviro.2022.101200] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Revised: 12/21/2021] [Accepted: 01/03/2022] [Indexed: 01/05/2023]
|
14
|
Ponsero AJ, Hurwitz BL, Magain N, Miadlikowska J, Lutzoni F, U'Ren JM. Cyanolichen microbiome contains novel viruses that encode genes to promote microbial metabolism. ISME COMMUNICATIONS 2021; 1:56. [PMID: 37938275 PMCID: PMC9723557 DOI: 10.1038/s43705-021-00060-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/19/2021] [Revised: 09/23/2021] [Accepted: 09/27/2021] [Indexed: 11/09/2023]
Abstract
Lichen thalli are formed through the symbiotic association of a filamentous fungus and photosynthetic green alga and/or cyanobacterium. Recent studies have revealed lichens also host highly diverse communities of secondary fungal and bacterial symbionts, yet few studies have examined the viral component within these complex symbioses. Here, we describe viral biodiversity and functions in cyanolichens collected from across North America and Europe. As current machine-learning viral-detection tools are not trained on complex eukaryotic metagenomes, we first developed efficient methods to remove eukaryotic reads prior to viral detection and a custom pipeline to validate viral contigs predicted with three machine-learning methods. Our resulting high-quality viral data illustrate that every cyanolichen thallus contains diverse viruses that are distinct from viruses in other terrestrial ecosystems. In addition to cyanobacteria, predicted viral hosts include other lichen-associated bacterial lineages and algae, although a large fraction of viral contigs had no host prediction. Functional annotation of cyanolichen viral sequences predicts numerous viral-encoded auxiliary metabolic genes (AMGs) involved in amino acid, nucleotide, and carbohydrate metabolism, including AMGs for secondary metabolism (antibiotics and antimicrobials) and fatty acid biosynthesis. Overall, the diversity of cyanolichen AMGs suggests that viruses may alter microbial interactions within these complex symbiotic assemblages.
Collapse
Affiliation(s)
- Alise J Ponsero
- BIO5 Institute and Department of Biosystems Engineering, University of Arizona, Tucson, AZ, 85721, USA
- Department of Medicine, University of Helsinki, Helsinki, Finland
| | - Bonnie L Hurwitz
- BIO5 Institute and Department of Biosystems Engineering, University of Arizona, Tucson, AZ, 85721, USA
| | - Nicolas Magain
- Department of Biology, Duke University, Durham, NC, 27708, USA
- Evolution and Conservation Biology, InBioS, University of Liège, Liège, Belgium
| | | | | | - Jana M U'Ren
- BIO5 Institute and Department of Biosystems Engineering, University of Arizona, Tucson, AZ, 85721, USA.
| |
Collapse
|
15
|
Glickman C, Hendrix J, Strong M. Simulation study and comparative evaluation of viral contiguous sequence identification tools. BMC Bioinformatics 2021; 22:329. [PMID: 34130621 PMCID: PMC8207588 DOI: 10.1186/s12859-021-04242-0] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2021] [Accepted: 06/04/2021] [Indexed: 01/21/2023] Open
Abstract
BACKGROUND Viruses, including bacteriophages, are important components of environmental and human associated microbial communities. Viruses can act as extracellular reservoirs of bacterial genes, can mediate microbiome dynamics, and can influence the virulence of clinical pathogens. Various targeted metagenomic analysis techniques detect viral sequences, but these methods often exclude large and genome integrated viruses. In this study, we evaluate and compare the ability of nine state-of-the-art bioinformatic tools, including Vibrant, VirSorter, VirSorter2, VirFinder, DeepVirFinder, MetaPhinder, Kraken 2, Phybrid, and a BLAST search using identified proteins from the Earth Virome Pipeline to identify viral contiguous sequences (contigs) across simulated metagenomes with different read distributions, taxonomic compositions, and complexities. RESULTS Of the tools tested in this study, VirSorter achieved the best F1 score while Vibrant had the highest average F1 score at predicting integrated prophages. Though less balanced in its precision and recall, Kraken2 had the highest average precision by a substantial margin. We introduced the machine learning tool, Phybrid, which demonstrated an improvement in average F1 score over tools such as MetaPhinder. The tool utilizes machine learning with both gene content and nucleotide features. The addition of nucleotide features improves the precision and recall compared to the gene content features alone.Viral identification by all tools was not impacted by underlying read distribution but did improve with contig length. Tool performance was inversely related to taxonomic complexity and varied by the phage host. For instance, Rhizobium and Enterococcus phages were identified consistently by the tools; whereas, Neisseria prophage sequences were commonly missed in this study. CONCLUSION This study benchmarked the performance of nine state-of-the-art bioinformatic tools to identify viral contigs across different simulation conditions. This study explored the ability of the tools to identify integrated prophage elements traditionally excluded from targeted sequencing approaches. Our comprehensive analysis of viral identification tools to assess their performance in a variety of situations provides valuable insights to viral researchers looking to mine viral elements from publicly available metagenomic data.
Collapse
Affiliation(s)
- Cody Glickman
- Center for Genes, Environment, and Health, National Jewish Health, 1400 Jackson Street, Denver, CO, 80206, USA.
- Computational Bioscience, University of Colorado Anschutz, 12801 E 17th Avenue, Aurora, CO, 80045, USA.
| | - Jo Hendrix
- Center for Genes, Environment, and Health, National Jewish Health, 1400 Jackson Street, Denver, CO, 80206, USA
- Computational Bioscience, University of Colorado Anschutz, 12801 E 17th Avenue, Aurora, CO, 80045, USA
| | - Michael Strong
- Center for Genes, Environment, and Health, National Jewish Health, 1400 Jackson Street, Denver, CO, 80206, USA
- Computational Bioscience, University of Colorado Anschutz, 12801 E 17th Avenue, Aurora, CO, 80045, USA
| |
Collapse
|
16
|
Pratama AA, Bolduc B, Zayed AA, Zhong ZP, Guo J, Vik DR, Gazitúa MC, Wainaina JM, Roux S, Sullivan MB. Expanding standards in viromics: in silico evaluation of dsDNA viral genome identification, classification, and auxiliary metabolic gene curation. PeerJ 2021; 9:e11447. [PMID: 34178438 PMCID: PMC8210812 DOI: 10.7717/peerj.11447] [Citation(s) in RCA: 53] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Accepted: 04/22/2021] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Viruses influence global patterns of microbial diversity and nutrient cycles. Though viral metagenomics (viromics), specifically targeting dsDNA viruses, has been critical for revealing viral roles across diverse ecosystems, its analyses differ in many ways from those used for microbes. To date, viromics benchmarking has covered read pre-processing, assembly, relative abundance, read mapping thresholds and diversity estimation, but other steps would benefit from benchmarking and standardization. Here we use in silico-generated datasets and an extensive literature survey to evaluate and highlight how dataset composition (i.e., viromes vs bulk metagenomes) and assembly fragmentation impact (i) viral contig identification tool, (ii) virus taxonomic classification, and (iii) identification and curation of auxiliary metabolic genes (AMGs). RESULTS The in silico benchmarking of five commonly used virus identification tools show that gene-content-based tools consistently performed well for long (≥3 kbp) contigs, while k-mer- and blast-based tools were uniquely able to detect viruses from short (≤3 kbp) contigs. Notably, however, the performance increase of k-mer- and blast-based tools for short contigs was obtained at the cost of increased false positives (sometimes up to ∼5% for virome and ∼75% bulk samples), particularly when eukaryotic or mobile genetic element sequences were included in the test datasets. For viral classification, variously sized genome fragments were assessed using gene-sharing network analytics to quantify drop-offs in taxonomic assignments, which revealed correct assignations ranging from ∼95% (whole genomes) down to ∼80% (3 kbp sized genome fragments). A similar trend was also observed for other viral classification tools such as VPF-class, ViPTree and VIRIDIC, suggesting that caution is warranted when classifying short genome fragments and not full genomes. Finally, we highlight how fragmented assemblies can lead to erroneous identification of AMGs and outline a best-practices workflow to curate candidate AMGs in viral genomes assembled from metagenomes. CONCLUSION Together, these benchmarking experiments and annotation guidelines should aid researchers seeking to best detect, classify, and characterize the myriad viruses 'hidden' in diverse sequence datasets.
Collapse
Affiliation(s)
- Akbar Adjie Pratama
- Department of Microbiology, Ohio State University, Columbus, OH, United States of America
- Center of Microbiome Science, Ohio State University, Columbus, OH, United States of America
| | - Benjamin Bolduc
- Department of Microbiology, Ohio State University, Columbus, OH, United States of America
- Center of Microbiome Science, Ohio State University, Columbus, OH, United States of America
| | - Ahmed A. Zayed
- Department of Microbiology, Ohio State University, Columbus, OH, United States of America
- Center of Microbiome Science, Ohio State University, Columbus, OH, United States of America
| | - Zhi-Ping Zhong
- Department of Microbiology, Ohio State University, Columbus, OH, United States of America
- Center of Microbiome Science, Ohio State University, Columbus, OH, United States of America
- Byrd Polar and Climate Research Center, Ohio State University, Columbus, OH, United States of America
| | - Jiarong Guo
- Department of Microbiology, Ohio State University, Columbus, OH, United States of America
- Center of Microbiome Science, Ohio State University, Columbus, OH, United States of America
| | - Dean R. Vik
- Department of Microbiology, Ohio State University, Columbus, OH, United States of America
- Center of Microbiome Science, Ohio State University, Columbus, OH, United States of America
| | | | - James M. Wainaina
- Department of Microbiology, Ohio State University, Columbus, OH, United States of America
- Center of Microbiome Science, Ohio State University, Columbus, OH, United States of America
- Infectious Diseases Institute at The Ohio State University, Ohio State University, Columbus, OH, United States of America
| | - Simon Roux
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, United States of America
| | - Matthew B. Sullivan
- Department of Microbiology, Ohio State University, Columbus, OH, United States of America
- Center of Microbiome Science, Ohio State University, Columbus, OH, United States of America
- Environmental and Geodetic Engineering, Ohio State University, Department of Civil, Columbus, OH, United States of America
| |
Collapse
|
17
|
Song K. Reads Binning Improves the Assembly of Viral Genome Sequences From Metagenomic Samples. Front Microbiol 2021; 12:664560. [PMID: 34093479 PMCID: PMC8175635 DOI: 10.3389/fmicb.2021.664560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2021] [Accepted: 04/27/2021] [Indexed: 11/13/2022] Open
Abstract
Metagenomes can be considered as mixtures of viral, bacterial, and other eukaryotic DNA sequences. Mining viral sequences from metagenomes could shed insight into virus-host relationships and expand viral databases. Current alignment-based methods are unsuitable for identifying viral sequences from metagenome sequences because most assembled metagenomic contigs are short and possess few or no predicted genes, and most metagenomic viral genes are dissimilar to known viral genes. In this study, I developed a Markov model-based method, VirMC, to identify viral sequences from metagenomic data. VirMC uses Markov chains to model sequence signatures and construct a scoring model using a likelihood test to distinguish viral and bacterial sequences. Compared with the other two state-of-the-art viral sequence-prediction methods, VirFinder and PPR-Meta, my proposed method outperformed VirFinder and had similar performance with PPR-Meta for short contigs with length less than 400 bp. VirMC outperformed VirFinder and PPR-Meta for identifying viral sequences in contaminated metagenomic samples with eukaryotic sequences. VirMC showed better performance in assembling viral-genome sequences from metagenomic data (based on filtering potential bacterial reads). Applying VirMC to human gut metagenomes from healthy subjects and patients with type-2 diabetes (T2D) revealed that viral contigs could help classify healthy and diseased statuses. This alignment-free method complements gene-based alignment approaches and will significantly improve the precision of viral sequence identification.
Collapse
Affiliation(s)
- Kai Song
- School of Mathematics and Statistics, Qingdao University, Qingdao China
| |
Collapse
|
18
|
Guo J, Bolduc B, Zayed AA, Varsani A, Dominguez-Huerta G, Delmont TO, Pratama AA, Gazitúa MC, Vik D, Sullivan MB, Roux S. VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. MICROBIOME 2021; 9:37. [PMID: 33522966 PMCID: PMC7852108 DOI: 10.1186/s40168-020-00990-y] [Citation(s) in RCA: 518] [Impact Index Per Article: 129.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/04/2020] [Accepted: 12/29/2020] [Indexed: 05/19/2023]
Abstract
BACKGROUND Viruses are a significant player in many biosphere and human ecosystems, but most signals remain "hidden" in metagenomic/metatranscriptomic sequence datasets due to the lack of universal gene markers, database representatives, and insufficiently advanced identification tools. RESULTS Here, we introduce VirSorter2, a DNA and RNA virus identification tool that leverages genome-informed database advances across a collection of customized automatic classifiers to improve the accuracy and range of virus sequence detection. When benchmarked against genomes from both isolated and uncultivated viruses, VirSorter2 uniquely performed consistently with high accuracy (F1-score > 0.8) across viral diversity, while all other tools under-detected viruses outside of the group most represented in reference databases (i.e., those in the order Caudovirales). Among the tools evaluated, VirSorter2 was also uniquely able to minimize errors associated with atypical cellular sequences including eukaryotic genomes and plasmids. Finally, as the virosphere exploration unravels novel viral sequences, VirSorter2's modular design makes it inherently able to expand to new types of viruses via the design of new classifiers to maintain maximal sensitivity and specificity. CONCLUSION With multi-classifier and modular design, VirSorter2 demonstrates higher overall accuracy across major viral groups and will advance our knowledge of virus evolution, diversity, and virus-microbe interaction in various ecosystems. Source code of VirSorter2 is freely available ( https://bitbucket.org/MAVERICLab/virsorter2 ), and VirSorter2 is also available both on bioconda and as an iVirus app on CyVerse ( https://de.cyverse.org/de ). Video abstract.
Collapse
Affiliation(s)
- Jiarong Guo
- Department of Microbiology, Ohio State University, Columbus, OH, 43210, USA
| | - Ben Bolduc
- Department of Microbiology, Ohio State University, Columbus, OH, 43210, USA
| | - Ahmed A Zayed
- Department of Microbiology, Ohio State University, Columbus, OH, 43210, USA
| | - Arvind Varsani
- The Biodesign Center for Fundamental and Applied Microbiomics, Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, 85287, USA
- Structural Biology Research Unit, Department of Integrative Biomedical Sciences, University of Cape Town, Observatory, Cape Town, 7701, South Africa
| | | | - Tom O Delmont
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91057, Evry, France
| | | | | | - Dean Vik
- Department of Microbiology, Ohio State University, Columbus, OH, 43210, USA
| | - Matthew B Sullivan
- Department of Microbiology, Ohio State University, Columbus, OH, 43210, USA.
- Civil, Environmental and Geodetic Engineering, Ohio State University, Columbus, OH, 43210, USA.
- Center of Microbiome Science, Ohio State University, Columbus, OH, 43210, USA.
| | - Simon Roux
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA.
| |
Collapse
|
19
|
Roux S, Páez-Espino D, Chen IMA, Palaniappan K, Ratner A, Chu K, Reddy TBK, Nayfach S, Schulz F, Call L, Neches RY, Woyke T, Ivanova NN, Eloe-Fadrosh EA, Kyrpides NC. IMG/VR v3: an integrated ecological and evolutionary framework for interrogating genomes of uncultivated viruses. Nucleic Acids Res 2021; 49:D764-D775. [PMID: 33137183 PMCID: PMC7778971 DOI: 10.1093/nar/gkaa946] [Citation(s) in RCA: 215] [Impact Index Per Article: 53.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 10/02/2020] [Accepted: 10/09/2020] [Indexed: 12/28/2022] Open
Abstract
Viruses are integral components of all ecosystems and microbiomes on Earth. Through pervasive infections of their cellular hosts, viruses can reshape microbial community structure and drive global nutrient cycling. Over the past decade, viral sequences identified from genomes and metagenomes have provided an unprecedented view of viral genome diversity in nature. Since 2016, the IMG/VR database has provided access to the largest collection of viral sequences obtained from (meta)genomes. Here, we present the third version of IMG/VR, composed of 18 373 cultivated and 2 314 329 uncultivated viral genomes (UViGs), nearly tripling the total number of sequences compared to the previous version. These clustered into 935 362 viral Operational Taxonomic Units (vOTUs), including 188 930 with two or more members. UViGs in IMG/VR are now reported as single viral contigs, integrated proviruses or genome bins, and are annotated with a new standardized pipeline including genome quality estimation using CheckV, taxonomic classification reflecting the latest ICTV update, and expanded host taxonomy prediction. The new IMG/VR interface enables users to efficiently browse, search, and select UViGs based on genome features and/or sequence similarity. IMG/VR v3 is available at https://img.jgi.doe.gov/vr, and the underlying data are available to download at https://genome.jgi.doe.gov/portal/IMG_VR.
Collapse
Affiliation(s)
- Simon Roux
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - David Páez-Espino
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - I-Min A Chen
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Krishna Palaniappan
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Anna Ratner
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Ken Chu
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - T B K Reddy
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Stephen Nayfach
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Frederik Schulz
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Lee Call
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Russell Y Neches
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Tanja Woyke
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Natalia N Ivanova
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Emiley A Eloe-Fadrosh
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Nikos C Kyrpides
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| |
Collapse
|
20
|
Abstract
Viruses are extremely diverse and modulate important biological and ecological processes globally. However, much of viral diversity remains uncultured and yet to be discovered. Several powerful culture-independent tools, in particular metagenomics, have substantially advanced virus discovery. Among those tools is single-virus genomics, which yields sequenced reference genomes from individual sorted virus particles without the need for cultivation. This new method complements virus culturing and metagenomic approaches and its advantages include targeted investigation of specific virus groups and investigation of genomic microdiversity within viral populations. In this Review, we provide a brief history of single-virus genomics, outline how this emergent method has facilitated advances in virus ecology and discuss its current limitations and future potential. Finally, we address how this method may synergistically intersect with other single-virus and single-cell approaches.
Collapse
|
21
|
Berbers B, Ceyssens PJ, Bogaerts P, Vanneste K, Roosens NHC, Marchal K, De Keersmaecker SCJ. Development of an NGS-Based Workflow for Improved Monitoring of Circulating Plasmids in Support of Risk Assessment of Antimicrobial Resistance Gene Dissemination. Antibiotics (Basel) 2020; 9:E503. [PMID: 32796589 PMCID: PMC7460218 DOI: 10.3390/antibiotics9080503] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Revised: 08/07/2020] [Accepted: 08/08/2020] [Indexed: 11/29/2022] Open
Abstract
Antimicrobial resistance (AMR) is one of the most prominent public health threats. AMR genes localized on plasmids can be easily transferred between bacterial isolates by horizontal gene transfer, thereby contributing to the spread of AMR. Next-generation sequencing (NGS) technologies are ideal for the detection of AMR genes; however, reliable reconstruction of plasmids is still a challenge due to large repetitive regions. This study proposes a workflow to reconstruct plasmids with NGS data in view of AMR gene localization, i.e., chromosomal or on a plasmid. Whole-genome and plasmid DNA extraction methods were compared, as were assemblies consisting of short reads (Illumina MiSeq), long reads (Oxford Nanopore Technologies) and a combination of both (hybrid). Furthermore, the added value of conjugation of a plasmid to a known host was evaluated. As a case study, an isolate harboring a large, low-copy mcr-1-carrying plasmid (>200 kb) was used. Hybrid assemblies of NGS data obtained from whole-genome DNA extractions of the original isolates resulted in the most complete reconstruction of plasmids. The optimal workflow was successfully applied to multidrug-resistant Salmonella Kentucky isolates, where the transfer of an ESBL-gene-containing fragment from a plasmid to the chromosome was detected. This study highlights a strategy including wet and dry lab parameters that allows accurate plasmid reconstruction, which will contribute to an improved monitoring of circulating plasmids and the assessment of their risk of transfer.
Collapse
Affiliation(s)
- Bas Berbers
- Transversal Activities in Applied Genomics, Sciensano, 1050 Brussels, Belgium; (B.B.); (K.V.); (N.H.C.R.)
- Department of Information Technology, IDLab, Ghent University, IMEC, 9052 Ghent, Belgium;
| | | | - Pierre Bogaerts
- National Reference Center for Antimicrobial Resistance in Gram-Negative Bacteria, CHU UCL Namur, 5530 Yvoir, Belgium;
| | - Kevin Vanneste
- Transversal Activities in Applied Genomics, Sciensano, 1050 Brussels, Belgium; (B.B.); (K.V.); (N.H.C.R.)
| | - Nancy H. C. Roosens
- Transversal Activities in Applied Genomics, Sciensano, 1050 Brussels, Belgium; (B.B.); (K.V.); (N.H.C.R.)
| | - Kathleen Marchal
- Department of Information Technology, IDLab, Ghent University, IMEC, 9052 Ghent, Belgium;
- Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium
- Department of Genetics, University of Pretoria, Pretoria 0083, South Africa
| | | |
Collapse
|
22
|
Khot V, Strous M, Hawley AK. Computational approaches in viral ecology. Comput Struct Biotechnol J 2020; 18:1605-1612. [PMID: 32670501 PMCID: PMC7334295 DOI: 10.1016/j.csbj.2020.06.019] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2020] [Revised: 06/09/2020] [Accepted: 06/10/2020] [Indexed: 01/21/2023] Open
Abstract
Dynamic virus-host interactions play a critical role in regulating microbial community structure and function. Yet for decades prior to the genomics era, viruses were largely overlooked in microbial ecology research, as only low-throughput culture-based methods of discovering viruses were available. With the advent of metagenomics, culture-independent techniques have provided exciting opportunities to discover and study new viruses. Here, we review recently developed computational methods for identifying viral sequences, exploring viral diversity in environmental samples, and predicting hosts from metagenomic sequence data. Methods to analyze viruses in silico utilize unconventional approaches to tackle challenges unique to viruses, such as vast diversity, mosaic viral genomes, and the lack of universal marker genes. As the field of viral ecology expands exponentially, computational advances have become increasingly important to gain insight into the role viruses in diverse habitats.
Collapse
Affiliation(s)
- Varada Khot
- Department of Geoscience, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - Marc Strous
- Department of Geoscience, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - Alyse K. Hawley
- Department of Geoscience, University of Calgary, Calgary, AB T2N 1N4, Canada
| |
Collapse
|
23
|
Kieft K, Zhou Z, Anantharaman K. VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences. MICROBIOME 2020; 8:90. [PMID: 32522236 PMCID: PMC7288430 DOI: 10.1186/s40168-020-00867-0] [Citation(s) in RCA: 497] [Impact Index Per Article: 99.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/27/2020] [Accepted: 05/13/2020] [Indexed: 05/08/2023]
Abstract
BACKGROUND Viruses are central to microbial community structure in all environments. The ability to generate large metagenomic assemblies of mixed microbial and viral sequences provides the opportunity to tease apart complex microbiome dynamics, but these analyses are currently limited by the tools available for analyses of viral genomes and assessing their metabolic impacts on microbiomes. DESIGN Here we present VIBRANT, the first method to utilize a hybrid machine learning and protein similarity approach that is not reliant on sequence features for automated recovery and annotation of viruses, determination of genome quality and completeness, and characterization of viral community function from metagenomic assemblies. VIBRANT uses neural networks of protein signatures and a newly developed v-score metric that circumvents traditional boundaries to maximize identification of lytic viral genomes and integrated proviruses, including highly diverse viruses. VIBRANT highlights viral auxiliary metabolic genes and metabolic pathways, thereby serving as a user-friendly platform for evaluating viral community function. VIBRANT was trained and validated on reference virus datasets as well as microbiome and virome data. RESULTS VIBRANT showed superior performance in recovering higher quality viruses and concurrently reduced the false identification of non-viral genome fragments in comparison to other virus identification programs, specifically VirSorter, VirFinder, and MARVEL. When applied to 120,834 metagenome-derived viral sequences representing several human and natural environments, VIBRANT recovered an average of 94% of the viruses, whereas VirFinder, VirSorter, and MARVEL achieved less powerful performance, averaging 48%, 87%, and 71%, respectively. Similarly, VIBRANT identified more total viral sequence and proteins when applied to real metagenomes. When compared to PHASTER, Prophage Hunter, and VirSorter for the ability to extract integrated provirus regions from host scaffolds, VIBRANT performed comparably and even identified proviruses that the other programs did not. To demonstrate applications of VIBRANT, we studied viromes associated with Crohn's disease to show that specific viral groups, namely Enterobacteriales-like viruses, as well as putative dysbiosis associated viral proteins are more abundant compared to healthy individuals, providing a possible viral link to maintenance of diseased states. CONCLUSIONS The ability to accurately recover viruses and explore viral impacts on microbial community metabolism will greatly advance our understanding of microbiomes, host-microbe interactions, and ecosystem dynamics. Video Abstract.
Collapse
Affiliation(s)
- Kristopher Kieft
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Zhichao Zhou
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Karthik Anantharaman
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, 53706, USA.
| |
Collapse
|
24
|
Coming-of-Age Characterization of Soil Viruses: A User’s Guide to Virus Isolation, Detection within Metagenomes, and Viromics. SOIL SYSTEMS 2020. [DOI: 10.3390/soilsystems4020023] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Abstract
The study of soil viruses, though not new, has languished relative to the study of marine viruses. This is particularly due to challenges associated with separating virions from harboring soils. Generally, three approaches to analyzing soil viruses have been employed: (1) Isolation, to characterize virus genotypes and phenotypes, the primary method used prior to the start of the 21st century. (2) Metagenomics, which has revealed a vast diversity of viruses while also allowing insights into viral community ecology, although with limitations due to DNA from cellular organisms obscuring viral DNA. (3) Viromics (targeted metagenomics of virus-like-particles), which has provided a more focused development of ‘virus-sequence-to-ecology’ pipelines, a result of separation of presumptive virions from cellular organisms prior to DNA extraction. This separation permits greater sequencing emphasis on virus DNA and thereby more targeted molecular and ecological characterization of viruses. Employing viromics to characterize soil systems presents new challenges, however. Ones that only recently are being addressed. Here we provide a guide to implementing these three approaches to studying environmental viruses, highlighting benefits, difficulties, and potential contamination, all toward fostering greater focus on viruses in the study of soil ecology.
Collapse
|
25
|
Ren J, Song K, Deng C, Ahlgren NA, Fuhrman JA, Li Y, Xie X, Poplin R, Sun F. Identifying viruses from metagenomic data using deep learning. QUANTITATIVE BIOLOGY 2020; 8:64-77. [PMID: 34084563 PMCID: PMC8172088 DOI: 10.1007/s40484-019-0187-4] [Citation(s) in RCA: 277] [Impact Index Per Article: 55.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2019] [Revised: 10/08/2019] [Accepted: 10/14/2019] [Indexed: 01/08/2023]
Abstract
BACKGROUND The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture. Existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data. METHODS Here we developed a reference-free and alignment-free machine learning method, DeepVirFinder, for identifying viral sequences in metagenomic data using deep learning. RESULTS Trained based on sequences from viral RefSeq discovered before May 2015, and evaluated on those discovered after that date, DeepVirFinder outperformed the state-of-the-art method VirFinder at all contig lengths, achieving AUROC 0.93, 0.95, 0.97, and 0.98 for 300, 500, 1000, and 3000 bp sequences respectively. Enlarging the training data with additional millions of purified viral sequences from metavirome samples further improved the accuracy for identifying virus groups that are under-represented. Applying DeepVirFinder to real human gut metagenomic samples, we identified 51,138 viral sequences belonging to 175 bins in patients with colorectal carcinoma (CRC). Ten bins were found associated with the cancer status, suggesting viruses may play important roles in CRC. CONCLUSIONS Powered by deep learning and high throughput sequencing metagenomic data, DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.
Collapse
Affiliation(s)
- Jie Ren
- Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA 90089, USA
| | - Kai Song
- School of Mathematics and Statistics, Qingdao University, Qingdao 266071, China
| | - Chao Deng
- Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA 90089, USA
| | | | - Jed A. Fuhrman
- Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Yi Li
- Department of Computer Science, University of California, Irvine, CA 92697, USA
| | - Xiaohui Xie
- Department of Computer Science, University of California, Irvine, CA 92697, USA
| | | | - Fengzhu Sun
- Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|