1
|
Zhao J, Both JP, Rodriguez-R LM, Konstantinidis KT. GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs. Nucleic Acids Res 2024; 52:e74. [PMID: 39011878 PMCID: PMC11381346 DOI: 10.1093/nar/gkae609] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 06/20/2024] [Accepted: 06/27/2024] [Indexed: 07/17/2024] Open
Abstract
Genome search and/or classification typically involves finding the best-match database (reference) genomes and has become increasingly challenging due to the growing number of available database genomes and the fact that traditional methods do not scale well with large databases. By combining k-mer hashing-based probabilistic data structures (i.e. ProbMinHash, SuperMinHash, Densified MinHash and SetSketch) to estimate genomic distance, with a graph based nearest neighbor search algorithm (Hierarchical Navigable Small World Graphs, or HNSW), we created a new data structure and developed an associated computer program, GSearch, that is orders of magnitude faster than alternative tools while maintaining high accuracy and low memory usage. For example, GSearch can search 8000 query genomes against all available microbial or viral genomes for their best matches (n = ∼318 000 or ∼3 000 000, respectively) within a few minutes on a personal laptop, using ∼6 GB of memory (2.5 GB via SetSketch). Notably, GSearch has an O(log(N)) time complexity and will scale well with billions of genomes based on a database splitting strategy. Further, GSearch implements a three-step search strategy depending on the degree of novelty of the query genomes to maximize specificity and sensitivity. Therefore, GSearch solves a major bottleneck of microbiome studies that require genome search and/or classification.
Collapse
Affiliation(s)
- Jianshu Zhao
- Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, Atlanta, GA, USA
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA
| | | | - Luis M Rodriguez-R
- School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA, USA
- Department of Microbiology, University of Innsbruck, Innsbruck, Austria
- Digital Science Center (DiSC), University of Innsbruck, Innsbruck, Austria
| | - Konstantinos T Konstantinidis
- Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, Atlanta, GA, USA
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA
- School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| |
Collapse
|
2
|
Ginnan NA, Custódio V, Gopaulchan D, Ford N, Salas-González I, Jones DD, Wells DM, Moreno Â, Castrillo G, Wagner MR. Persistent legacy effects on soil metagenomes facilitate plant adaptive responses to drought. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.26.609769. [PMID: 39253412 PMCID: PMC11383273 DOI: 10.1101/2024.08.26.609769] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/11/2024]
Abstract
Both chronic and acute drought alter the composition and physiology of soil microbiomes, with implications for globally important processes including carbon cycling and plant productivity. When water is scarce, selection favors microbes with thicker peptidoglycan cell walls, sporulation ability, and constitutive osmolyte production (Schimel, Balser, and Wallenstein 2007)-but also the ability to degrade complex plant-derived polysaccharides, suggesting that the success of plants and microbes during drought are inextricably linked. However, communities vary enormously in their drought responses and subsequent interactions with plants. Hypothesized causes of this variation in drought resilience include soil texture, soil chemistry, and historical precipitation patterns that shaped the starting communities and their constituent species (Evans, Allison, and Hawkes 2022). Currently, the physiological and molecular mechanisms of microbial drought responses and microbe-dependent plant drought responses in diverse natural soils are largely unknown (de Vries et al. 2023). Here, we identify numerous microbial taxa, genes, and functions that characterize soil microbiomes with legacies of chronic water limitation. Soil microbiota from historically dry climates buffered plants from the negative effects of subsequent acute drought, but only for a wild grass species native to the same region, and not for domesticated maize. In particular, microbiota with a legacy of chronic water limitation altered the expression of a small subset of host genes in crown roots, which mediated the effect of acute drought on transpiration and intrinsic water use efficiency. Our results reveal how long-term exposure to water stress alters soil microbial communities at the metagenomic level, and demonstrate the resulting "legacy effects" on neighboring plants in unprecedented molecular and physiological detail.
Collapse
|
3
|
Osburn ED, McBride SG, Bahram M, Strickland MS. Global patterns in the growth potential of soil bacterial communities. Nat Commun 2024; 15:6881. [PMID: 39128916 PMCID: PMC11317499 DOI: 10.1038/s41467-024-50382-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Accepted: 07/09/2024] [Indexed: 08/13/2024] Open
Abstract
Despite the growing catalogue of studies detailing the taxonomic and functional composition of soil bacterial communities, the life history traits of those communities remain largely unknown. This study analyzes a global dataset of soil metagenomes to explore environmental drivers of growth potential, a fundamental aspect of bacterial life history. We find that growth potential, estimated from codon usage statistics, was highest in forested biomes and lowest in arid latitudes. This indicates that bacterial productivity generally reflects ecosystem productivity globally. Accordingly, the strongest environmental predictors of growth potential were productivity indicators, such as distance to the equator, and soil properties that vary along productivity gradients, such as pH and carbon to nitrogen ratios. We also observe that growth potential was negatively correlated with the relative abundances of genes involved in carbohydrate metabolism, demonstrating tradeoffs between growth and resource acquisition in soil bacteria. Overall, we identify macroecological patterns in bacterial growth potential and link growth rates to soil carbon cycling.
Collapse
Affiliation(s)
- Ernest D Osburn
- Department of Plant and Soil Sciences, University of Kentucky, Lexington, KY, USA.
- Department of Soil and Water Systems, University of Idaho, Moscow, ID, USA.
| | | | - Mohammad Bahram
- Department of Ecology, Swedish University of Agricultural Sciences, Uppsala, Sweden
- Institute of Ecology and Earth Sciences, University of Tartu, Tartu, Estonia
- Department of Agroecology, Aarhus University, Slagelse, Denmark
| | | |
Collapse
|
4
|
Fromm A, Hevroni G, Vincent F, Schatz D, Martinez-Gutierrez CA, Aylward FO, Vardi A. Single-cell RNA-seq of the rare virosphere reveals the native hosts of giant viruses in the marine environment. Nat Microbiol 2024; 9:1619-1629. [PMID: 38605173 PMCID: PMC11265207 DOI: 10.1038/s41564-024-01669-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 03/07/2024] [Indexed: 04/13/2024]
Abstract
Giant viruses (phylum Nucleocytoviricota) are globally distributed in aquatic ecosystems. They play fundamental roles as evolutionary drivers of eukaryotic plankton and regulators of global biogeochemical cycles. However, we lack knowledge about their native hosts, hindering our understanding of their life cycle and ecological importance. In the present study, we applied a single-cell RNA sequencing (scRNA-seq) approach to samples collected during an induced algal bloom, which enabled pairing active giant viruses with their native protist hosts. We detected hundreds of single cells from multiple host lineages infected by diverse giant viruses. These host cells included members of the algal groups Chrysophycae and Prymnesiophycae, as well as heterotrophic flagellates in the class Katablepharidaceae. Katablepharids were infected with a rare Imitervirales-07 giant virus lineage expressing a large repertoire of cell-fate regulation genes. Analysis of the temporal dynamics of these host-virus interactions revealed an important role for the Imitervirales-07 in controlling the population size of the host Katablepharid population. Our results demonstrate that scRNA-seq can be used to identify previously undescribed host-virus interactions and study their ecological importance and impact.
Collapse
Affiliation(s)
- Amir Fromm
- Department of Plant and Environmental Sciences, Weizmann Institute of Science, Rehovot, Israel
| | - Gur Hevroni
- Department of Plant and Environmental Sciences, Weizmann Institute of Science, Rehovot, Israel
- Google Geo, Tel Aviv, Israel
| | - Flora Vincent
- Department of Plant and Environmental Sciences, Weizmann Institute of Science, Rehovot, Israel
- Developmental Biology Unit, European Molecular Biological Laboratory, Heidelberg, Germany
| | - Daniella Schatz
- Department of Plant and Environmental Sciences, Weizmann Institute of Science, Rehovot, Israel
| | | | - Frank O Aylward
- Department of Biological Sciences, Virginia Tech, Blacksburg, VA, USA.
- Center for Emerging, Zoonotic, and Arthropod-Borne Pathogens, Virginia Tech, Blacksburg, VA, USA.
| | - Assaf Vardi
- Department of Plant and Environmental Sciences, Weizmann Institute of Science, Rehovot, Israel.
| |
Collapse
|
5
|
Figueroa JL, Redinbo A, Panyala A, Colby S, Friesen ML, Tiemann L, White RA. MerCat2: a versatile k-mer counter and diversity estimator for database-independent property analysis obtained from omics data. BIOINFORMATICS ADVANCES 2024; 4:vbae061. [PMID: 38745763 PMCID: PMC11090762 DOI: 10.1093/bioadv/vbae061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Revised: 02/28/2024] [Accepted: 04/22/2024] [Indexed: 05/16/2024]
Abstract
Motivation MerCat2 ("Mer-Catenate2") is a versatile, parallel, scalable and modular property software package for robustly analyzing features in omics data. Using massively parallel sequencing raw reads, assembled contigs, and protein sequences from any platform as input, MerCat2 performs k-mer counting of any length k, resulting in feature abundance counts tables, quality control reports, protein feature metrics, and graphical representation (i.e. principal component analysis (PCA)). Results MerCat2 allows for direct analysis of data properties in a database-independent manner that initializes all data, which other profilers and assembly-based methods cannot perform. MerCat2 represents an integrated tool to illuminate omics data within a sample for rapid cross-examination and comparisons. Availability and implementation MerCat2 is written in Python and distributed under a BSD-3 license. The source code of MerCat2 is freely available at https://github.com/raw-lab/mercat2. MerCat2 is compatible with Python 3 on Mac OS X and Linux. MerCat2 can also be easily installed using bioconda: mamba create -n mercat2 -c conda-forge -c bioconda mercat2.
Collapse
Affiliation(s)
- Jose L Figueroa
- Department of Bioinformatics and Genomics, North Carolina Research Center (NCRC), The University of North Carolina at Charlotte, Kannapolis, NC 28081, United States
- Department of Bioinformatics and Genomics, Computational Intelligence to Predict Health and Environmental Risks (CIPHER), The University of North Carolina at Charlotte, Charlotte, NC 28223, United States
| | - Andrew Redinbo
- Department of Bioinformatics and Genomics, North Carolina Research Center (NCRC), The University of North Carolina at Charlotte, Kannapolis, NC 28081, United States
- Department of Bioinformatics and Genomics, Computational Intelligence to Predict Health and Environmental Risks (CIPHER), The University of North Carolina at Charlotte, Charlotte, NC 28223, United States
| | - Ajay Panyala
- High Performance Computing (HPC) Group, Pacific Northwest National Laboratory, Richland, WA 99352, United States
| | - Sean Colby
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA 99352, United States
| | - Maren L Friesen
- Department of Plant Pathology, Washington State University, Pullman, WA 99163, United States
| | - Lisa Tiemann
- Department of Plant, Soil and Microbial Sciences, Michigan State University, East Lansing, MI 48824, United States
| | - Richard Allen White
- Department of Bioinformatics and Genomics, North Carolina Research Center (NCRC), The University of North Carolina at Charlotte, Kannapolis, NC 28081, United States
- Department of Bioinformatics and Genomics, Computational Intelligence to Predict Health and Environmental Risks (CIPHER), The University of North Carolina at Charlotte, Charlotte, NC 28223, United States
| |
Collapse
|
6
|
Figueroa III JL, Dhungel E, Bellanger M, Brouwer CR, White III RA. MetaCerberus: distributed highly parallelized HMM-based processing for robust functional annotation across the tree of life. Bioinformatics 2024; 40:btae119. [PMID: 38426351 PMCID: PMC10955254 DOI: 10.1093/bioinformatics/btae119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 01/22/2024] [Accepted: 02/27/2024] [Indexed: 03/02/2024] Open
Abstract
MOTIVATION MetaCerberus is a massively parallel, fast, low memory, scalable annotation tool for inference gene function across genomes to metacommunities. MetaCerberus provides an elusive HMM/HMMER-based tool at a rapid scale with low memory. It offers scalable gene elucidation to major public databases, including KEGG (KO), COGs, CAZy, FOAM, and specific databases for viruses, including VOGs and PHROGs, from single genomes to metacommunities. RESULTS MetaCerberus is 1.3× as fast on a single node than eggNOG-mapper v2 on 5× less memory using an exclusively HMM/HMMER mode. In a direct comparison, MetaCerberus provides better annotation of viruses, phages, and archaeal viruses than DRAM, Prokka, or InterProScan. MetaCerberus annotates more KOs across domains when compared to DRAM, with a 186× smaller database, and with 63× less memory. MetaCerberus is fully integrated for automatic analysis of statistics and pathways using differential statistic tools (i.e. DESeq2 and edgeR), pathway enrichment (GAGE R), and pathview R. MetaCerberus provides a novel tool for unlocking the biosphere across the tree of life at scale. AVAILABILITY AND IMPLEMENTATION MetaCerberus is written in Python and distributed under a BSD-3 license. The source code of MetaCerberus is freely available at https://github.com/raw-lab/metacerberus compatible with Python 3 and works on both Mac OS X and Linux. MetaCerberus can also be easily installed using bioconda: mamba create -n metacerberus -c bioconda -c conda-forge metacerberus.
Collapse
Affiliation(s)
- Jose L Figueroa III
- North Carolina Research Campus (NCRC), Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, Kannapolis, NC 28081, United States
- Computational Intelligence to Predict Health and Environmental Risks (CIPHER) Research Center, Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, Charlotte, NC 28223, United States
| | - Eliza Dhungel
- North Carolina Research Campus (NCRC), Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, Kannapolis, NC 28081, United States
| | - Madeline Bellanger
- North Carolina Research Campus (NCRC), Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, Kannapolis, NC 28081, United States
- Computational Intelligence to Predict Health and Environmental Risks (CIPHER) Research Center, Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, Charlotte, NC 28223, United States
| | - Cory R Brouwer
- North Carolina Research Campus (NCRC), Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, Kannapolis, NC 28081, United States
| | - Richard Allen White III
- North Carolina Research Campus (NCRC), Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, Kannapolis, NC 28081, United States
- Computational Intelligence to Predict Health and Environmental Risks (CIPHER) Research Center, Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, Charlotte, NC 28223, United States
| |
Collapse
|