1
|
Jang CS, Kim H, Kim D, Han B. MicroPredict: predicting species-level taxonomic abundance of whole-shotgun metagenomic data using only 16S amplicon sequencing data. Genes Genomics 2024; 46:701-712. [PMID: 38700829 PMCID: PMC11102407 DOI: 10.1007/s13258-024-01514-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Accepted: 03/26/2024] [Indexed: 05/19/2024]
Abstract
BACKGROUND The importance of the human microbiome in the analysis of various diseases is emerging. The two main methods used to profile the human microbiome are 16S rRNA gene sequencing (16S sequencing) and whole-genome shotgun sequencing (WGS). Owing to the full coverage of the genome in sequencing, WGS has multiple advantages over 16S sequencing, including higher taxonomic profiling resolution at the species-level and functional profiling analysis. However, 16S sequencing remains widely used because of its relatively low cost. Although WGS is the standard method for obtaining accurate species-level data, we found that 16S sequencing data contained rich information to predict high-resolution species-level abundances with reasonable accuracy. OBJECTIVE In this study, we proposed MicroPredict, a method for accurately predicting WGS-comparable species-level abundance data using 16S taxonomic profile data. METHODS We employed a mixed model using two key strategies: (1) modeling both sample- and species-specific information for predicting WGS abundances, and (2) accounting for the possible correlations among different species. RESULTS We found that MicroPredict outperformed the other machine learning methods. CONCLUSION We expect that our approach will help researchers accurately approximate the species-level abundances of microbiome profiles in datasets for which only cost-effective 16S sequencing has been applied.
Collapse
Affiliation(s)
- Chloe Soohyun Jang
- Department of Biomedical Sciences, Seoul National University College of Medicine, Seoul, South Korea
| | - Hakin Kim
- Interdisciplinary Program for Bioengineering, Seoul National University, Seoul, South Korea
| | - Donghyun Kim
- Department of Biomedical Sciences, Seoul National University College of Medicine, Seoul, South Korea
| | - Buhm Han
- Department of Biomedical Sciences, Seoul National University College of Medicine, Seoul, South Korea.
- Interdisciplinary Program for Bioengineering, Seoul National University, Seoul, South Korea.
| |
Collapse
|
2
|
Tian Q, Zhang P, Zhai Y, Wang Y, Zou Q. Application and Comparison of Machine Learning and Database-Based Methods in Taxonomic Classification of High-Throughput Sequencing Data. Genome Biol Evol 2024; 16:evae102. [PMID: 38748485 PMCID: PMC11135637 DOI: 10.1093/gbe/evae102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/12/2024] [Indexed: 05/30/2024] Open
Abstract
The advent of high-throughput sequencing technologies has not only revolutionized the field of bioinformatics but has also heightened the demand for efficient taxonomic classification. Despite technological advancements, efficiently processing and analyzing the deluge of sequencing data for precise taxonomic classification remains a formidable challenge. Existing classification approaches primarily fall into two categories, database-based methods and machine learning methods, each presenting its own set of challenges and advantages. On this basis, the aim of our study was to conduct a comparative analysis between these two methods while also investigating the merits of integrating multiple database-based methods. Through an in-depth comparative study, we evaluated the performance of both methodological categories in taxonomic classification by utilizing simulated data sets. Our analysis revealed that database-based methods excel in classification accuracy when backed by a rich and comprehensive reference database. Conversely, while machine learning methods show superior performance in scenarios where reference sequences are sparse or lacking, they generally show inferior performance compared with database methods under most conditions. Moreover, our study confirms that integrating multiple database-based methods does, in fact, enhance classification accuracy. These findings shed new light on the taxonomic classification of high-throughput sequencing data and bear substantial implications for the future development of computational biology. For those interested in further exploring our methods, the source code of this study is publicly available on https://github.com/LoadStar822/Genome-Classifier-Performance-Evaluator. Additionally, a dedicated webpage showcasing our collected database, data sets, and various classification software can be found at http://lab.malab.cn/~tqz/project/taxonomic/.
Collapse
Affiliation(s)
- Qinzhong Tian
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| | - Pinglu Zhang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| | - Yixiao Zhai
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| | - Yansu Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| |
Collapse
|
3
|
Walsh LH, Coakley M, Walsh AM, O'Toole PW, Cotter PD. Bioinformatic approaches for studying the microbiome of fermented food. Crit Rev Microbiol 2023; 49:693-725. [PMID: 36287644 DOI: 10.1080/1040841x.2022.2132850] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 08/11/2022] [Accepted: 09/28/2022] [Indexed: 11/03/2022]
Abstract
High-throughput DNA sequencing-based approaches continue to revolutionise our understanding of microbial ecosystems, including those associated with fermented foods. Metagenomic and metatranscriptomic approaches are state-of-the-art biological profiling methods and are employed to investigate a wide variety of characteristics of microbial communities, such as taxonomic membership, gene content and the range and level at which these genes are expressed. Individual groups and consortia of researchers are utilising these approaches to produce increasingly large and complex datasets, representing vast populations of microorganisms. There is a corresponding requirement for the development and application of appropriate bioinformatic tools and pipelines to interpret this data. This review critically analyses the tools and pipelines that have been used or that could be applied to the analysis of metagenomic and metatranscriptomic data from fermented foods. In addition, we critically analyse a number of studies of fermented foods in which these tools have previously been applied, to highlight the insights that these approaches can provide.
Collapse
Affiliation(s)
- Liam H Walsh
- Teagasc Food Research Centre, Moorepark, Fermoy, Cork, Ireland
- School of Microbiology, University College Cork, Ireland
| | - Mairéad Coakley
- Teagasc Food Research Centre, Moorepark, Fermoy, Cork, Ireland
| | - Aaron M Walsh
- Teagasc Food Research Centre, Moorepark, Fermoy, Cork, Ireland
| | - Paul W O'Toole
- School of Microbiology, University College Cork, Ireland
- APC Microbiome Ireland, University College Cork, Ireland
| | - Paul D Cotter
- Teagasc Food Research Centre, Moorepark, Fermoy, Cork, Ireland
- APC Microbiome Ireland, University College Cork, Ireland
- VistaMilk SFI Research Centre, Teagasc, Moorepark, Fermoy, Cork, Ireland
| |
Collapse
|
4
|
Budiš J, Krampl W, Kucharík M, Hekel R, Goga A, Sitarčík J, Lichvár M, Smol’ak D, Böhmer M, Baláž A, Ďuriš F, Gazdarica J, Šoltys K, Turňa J, Radvánszky J, Szemes T. SnakeLines: integrated set of computational pipelines for sequencing reads. J Integr Bioinform 2023; 20:jib-2022-0059. [PMID: 37602733 PMCID: PMC10757078 DOI: 10.1515/jib-2022-0059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2022] [Accepted: 03/21/2023] [Indexed: 08/22/2023] Open
Abstract
With the rapid growth of massively parallel sequencing technologies, still more laboratories are utilising sequenced DNA fragments for genomic analyses. Interpretation of sequencing data is, however, strongly dependent on bioinformatics processing, which is often too demanding for clinicians and researchers without a computational background. Another problem represents the reproducibility of computational analyses across separated computational centres with inconsistent versions of installed libraries and bioinformatics tools. We propose an easily extensible set of computational pipelines, called SnakeLines, for processing sequencing reads; including mapping, assembly, variant calling, viral identification, transcriptomics, and metagenomics analysis. Individual steps of an analysis, along with methods and their parameters can be readily modified in a single configuration file. Provided pipelines are embedded in virtual environments that ensure isolation of required resources from the host operating system, rapid deployment, and reproducibility of analysis across different Unix-based platforms. SnakeLines is a powerful framework for the automation of bioinformatics analyses, with emphasis on a simple set-up, modifications, extensibility, and reproducibility. The framework is already routinely used in various research projects and their applications, especially in the Slovak national surveillance of SARS-CoV-2.
Collapse
Affiliation(s)
- Jaroslav Budiš
- Geneton Ltd., 841 04Bratislava, Slovakia
- Slovak Centre of Scientific and Technical Information, 811 04Bratislava, Slovakia
- Comenius University Science Park, 841 04Bratislava, Slovakia
| | - Werner Krampl
- Geneton Ltd., 841 04Bratislava, Slovakia
- Comenius University Science Park, 841 04Bratislava, Slovakia
- Department of Molecular Biology, Faculty of Natural Sciences, Comenius University, 841 04Bratislava, Slovakia
| | - Marcel Kucharík
- Geneton Ltd., 841 04Bratislava, Slovakia
- Comenius University Science Park, 841 04Bratislava, Slovakia
| | - Rastislav Hekel
- Geneton Ltd., 841 04Bratislava, Slovakia
- Slovak Centre of Scientific and Technical Information, 811 04Bratislava, Slovakia
- Department of Molecular Biology, Faculty of Natural Sciences, Comenius University, 841 04Bratislava, Slovakia
| | - Adrián Goga
- Comenius University Science Park, 841 04Bratislava, Slovakia
- Department of Computer Science, Faculty of Mathematics, Physics and Informatics, Comenius University, 841 04Bratislava, Slovakia
| | - Jozef Sitarčík
- Geneton Ltd., 841 04Bratislava, Slovakia
- Slovak Centre of Scientific and Technical Information, 811 04Bratislava, Slovakia
- Comenius University Science Park, 841 04Bratislava, Slovakia
| | - Michal Lichvár
- Geneton Ltd., 841 04Bratislava, Slovakia
- Comenius University Science Park, 841 04Bratislava, Slovakia
| | - Dávid Smol’ak
- Geneton Ltd., 841 04Bratislava, Slovakia
- Department of Molecular Biology, Faculty of Natural Sciences, Comenius University, 841 04Bratislava, Slovakia
| | - Miroslav Böhmer
- Geneton Ltd., 841 04Bratislava, Slovakia
- Comenius University Science Park, 841 04Bratislava, Slovakia
- Department of Molecular Biology, Faculty of Natural Sciences, Comenius University, 841 04Bratislava, Slovakia
| | - Andrej Baláž
- Geneton Ltd., 841 04Bratislava, Slovakia
- Department of Applied Informatics, Faculty of Mathematics, Physics and Informatics, Comenius University, 841 04Bratislava, Slovakia
| | - František Ďuriš
- Geneton Ltd., 841 04Bratislava, Slovakia
- Slovak Centre of Scientific and Technical Information, 811 04Bratislava, Slovakia
| | - Juraj Gazdarica
- Geneton Ltd., 841 04Bratislava, Slovakia
- Slovak Centre of Scientific and Technical Information, 811 04Bratislava, Slovakia
| | - Katarína Šoltys
- Comenius University Science Park, 841 04Bratislava, Slovakia
- Department of Molecular Biology, Faculty of Natural Sciences, Comenius University, 841 04Bratislava, Slovakia
| | - Ján Turňa
- Slovak Centre of Scientific and Technical Information, 811 04Bratislava, Slovakia
- Comenius University Science Park, 841 04Bratislava, Slovakia
- Department of Molecular Biology, Faculty of Natural Sciences, Comenius University, 841 04Bratislava, Slovakia
| | - Ján Radvánszky
- Geneton Ltd., 841 04Bratislava, Slovakia
- Comenius University Science Park, 841 04Bratislava, Slovakia
- Institute of Clinical and Translational Research, Biomedical Research Center, Slovak Academy of Sciences, 845 05Bratislava, Slovakia
| | - Tomáš Szemes
- Geneton Ltd., 841 04Bratislava, Slovakia
- Comenius University Science Park, 841 04Bratislava, Slovakia
- Department of Molecular Biology, Faculty of Natural Sciences, Comenius University, 841 04Bratislava, Slovakia
| |
Collapse
|
5
|
Shen K, Din AU, Sinha B, Zhou Y, Qian F, Shen B. Translational informatics for human microbiota: data resources, models and applications. Brief Bioinform 2023; 24:7152256. [PMID: 37141135 DOI: 10.1093/bib/bbad168] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2022] [Revised: 04/07/2023] [Accepted: 04/11/2023] [Indexed: 05/05/2023] Open
Abstract
With the rapid development of human intestinal microbiology and diverse microbiome-related studies and investigations, a large amount of data have been generated and accumulated. Meanwhile, different computational and bioinformatics models have been developed for pattern recognition and knowledge discovery using these data. Given the heterogeneity of these resources and models, we aimed to provide a landscape of the data resources, a comparison of the computational models and a summary of the translational informatics applied to microbiota data. We first review the existing databases, knowledge bases, knowledge graphs and standardizations of microbiome data. Then, the high-throughput sequencing techniques for the microbiome and the informatics tools for their analyses are compared. Finally, translational informatics for the microbiome, including biomarker discovery, personalized treatment and smart healthcare for complex diseases, are discussed.
Collapse
Affiliation(s)
- Ke Shen
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
| | - Ahmad Ud Din
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
| | - Baivab Sinha
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
| | - Yi Zhou
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
| | - Fuliang Qian
- Center for Systems Biology, Suzhou Medical College of Soochow University, Suzhou 215123, China
- Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Suzhou 215123, China
| | - Bairong Shen
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
| |
Collapse
|
6
|
Abstract
Experiments involving metagenomics data are become increasingly commonplace. Processing such data requires a unique set of considerations. Quality control of metagenomics data is critical to extracting pertinent insights. In this chapter, we outline some considerations in terms of study design and other confounding factors that can often only be realized at the point of data analysis.In this chapter, we outline some basic principles of quality control in metagenomics, including overall reproducibility and some good practices to follow. The general quality control of sequencing data is then outlined, and we introduce ways to process this data by using bash scripts and developing pipelines in Snakemake (Python).A significant part of quality control in metagenomics is in analyzing the data to ensure you can spot relationships between variables and to identify when they might be confounded. This chapter provides a walkthrough of analyzing some microbiome data (in the R statistical language) and demonstrates a few days to identify overall differences and similarities in microbiome data. The chapter is concluded by discussing remarks about considering taxonomic results in the context of the study and interrogating sequence alignments using the command line.
Collapse
Affiliation(s)
- Abraham Gihawi
- Bob Champion Research & Education Building, Norwich Medical School, University of East Anglia, Norwich, UK
| | - Ryan Cardenas
- Bob Champion Research & Education Building, Norwich Medical School, University of East Anglia, Norwich, UK
| | - Rachel Hurst
- Bob Champion Research & Education Building, Norwich Medical School, University of East Anglia, Norwich, UK
| | - Daniel S Brewer
- Bob Champion Research & Education Building, Norwich Medical School, University of East Anglia, Norwich, UK.
- Earlham Institute, Norwich Research Park, Norwich, UK.
| |
Collapse
|
7
|
Garrido-Sanz L, Àngel Senar M, Piñol J. Drastic reduction of false positive species in samples of insects by intersecting the default output of two popular metagenomic classifiers. PLoS One 2022; 17:e0275790. [PMID: 36282811 PMCID: PMC9595558 DOI: 10.1371/journal.pone.0275790] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2022] [Accepted: 09/15/2022] [Indexed: 11/19/2022] Open
Abstract
The use of high-throughput sequencing to recover short DNA reads of many species has been widely applied on biodiversity studies, either as amplicon metabarcoding or shotgun metagenomics. These reads are assigned to taxa using classifiers. However, for different reasons, the results often contain many false positives. Here we focus on the reduction of false positive species attributable to the classifiers. We benchmarked two popular classifiers, BLASTn followed by MEGAN6 (BM) and Kraken2 (K2), to analyse shotgun sequenced artificial single-species samples of insects. To reduce the number of misclassified reads, we combined the output of the two classifiers in two different ways: (1) by keeping only the reads that were attributed to the same species by both classifiers (intersection approach); and (2) by keeping the reads assigned to some species by any classifier (union approach). In addition, we applied an analytical detection limit to further reduce the number of false positives species. As expected, both metagenomic classifiers used with default parameters generated an unacceptably high number of misidentified species (tens with BM, hundreds with K2). The false positive species were not necessarily phylogenetically close, as some of them belonged to different orders of insects. The union approach failed to reduce the number of false positives, but the intersection approach got rid of most of them. The addition of an analytic detection limit of 0.001 further reduced the number to ca. 0.5 false positive species per sample. The misidentification of species by most classifiers hampers the confidence of the DNA-based methods for assessing the biodiversity of biological samples. Our approach to alleviate the problem is straightforward and significantly reduced the number of reported false positive species.
Collapse
Affiliation(s)
- Lidia Garrido-Sanz
- Universitat Autònoma de Barcelona, Cerdanyola del Vallès, Spain
- * E-mail:
| | | | - Josep Piñol
- Universitat Autònoma de Barcelona, Cerdanyola del Vallès, Spain
- CREAF, Cerdanyola del Vallès, Spain
| |
Collapse
|
8
|
Bhattacharya C, Tierney BT, Ryon KA, Bhattacharyya M, Hastings JJA, Basu S, Bhattacharya B, Bagchi D, Mukherjee S, Wang L, Henaff EM, Mason CE. Supervised Machine Learning Enables Geospatial Microbial Provenance. Genes (Basel) 2022; 13:1914. [PMID: 36292799 PMCID: PMC9601318 DOI: 10.3390/genes13101914] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Revised: 10/14/2022] [Accepted: 10/18/2022] [Indexed: 11/04/2022] Open
Abstract
The recent increase in publicly available metagenomic datasets with geospatial metadata has made it possible to determine location-specific, microbial fingerprints from around the world. Such fingerprints can be useful for comparing microbial niches for environmental research, as well as for applications within forensic science and public health. To determine the regional specificity for environmental metagenomes, we examined 4305 shotgun-sequenced samples from the MetaSUB Consortium dataset-the most extensive public collection of urban microbiomes, spanning 60 different cities, 30 countries, and 6 continents. We were able to identify city-specific microbial fingerprints using supervised machine learning (SML) on the taxonomic classifications, and we also compared the performance of ten SML classifiers. We then further evaluated the five algorithms with the highest accuracy, with the city and continental accuracy ranging from 85-89% to 90-94%, respectively. Thereafter, we used these results to develop Cassandra, a random-forest-based classifier that identifies bioindicator species to aid in fingerprinting and can infer higher-order microbial interactions at each site. We further tested the Cassandra algorithm on the Tara Oceans dataset, the largest collection of marine-based microbial genomes, where it classified the oceanic sample locations with 83% accuracy. These results and code show the utility of SML methods and Cassandra to identify bioindicator species across both oceanic and urban environments, which can help guide ongoing efforts in biotracing, environmental monitoring, and microbial forensics (MF).
Collapse
Affiliation(s)
- Chandrima Bhattacharya
- Tri-Institutional Computational Biology & Medicine Program, Weill Cornell Medicine, New York, NY 10065, USA
- The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY 10065, USA
- Integrated Design and Media, Center for Urban Science and Progress, NYU Tandon School of Engineering, Brooklyn, New York, NY 11201, USA
| | - Braden T. Tierney
- The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY 10065, USA
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065, USA
| | - Krista A. Ryon
- The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY 10065, USA
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065, USA
| | - Malay Bhattacharyya
- Center for Artificial Intelligence and Machine Learning, Indian Statistical Institute, Kolkata 700108, India
- Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700108, India
| | - Jaden J. A. Hastings
- The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY 10065, USA
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065, USA
| | - Srijani Basu
- Department of Medicine, Weill Cornell Medicine, New York, NY 10065, USA
| | - Bodhisatwa Bhattacharya
- Department of Electrical and Electronics Engineering, Birla Institute of Technology, Mesra, Ranchi 835215, India
| | - Debneel Bagchi
- Department of Metallurgy & Materials Engineering, Indian Institute of Engineering Science & Technology, Shibpur, Howrah 711103, India
| | - Somsubhro Mukherjee
- Department of Biological Sciences, National University of Singapore, Singapore 117558, Singapore
| | - Lu Wang
- Department of Biological Sciences, National University of Singapore, Singapore 117558, Singapore
| | - Elizabeth M. Henaff
- Integrated Design and Media, Center for Urban Science and Progress, NYU Tandon School of Engineering, Brooklyn, New York, NY 11201, USA
| | - Christopher E. Mason
- The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY 10065, USA
- Integrated Design and Media, Center for Urban Science and Progress, NYU Tandon School of Engineering, Brooklyn, New York, NY 11201, USA
- WorldQuant Initiative for Quantitative Prediction, Weill Cornell Medicine, New York, NY 10065, USA
| |
Collapse
|
9
|
Bartoszewicz JM, Nasri F, Nowicka M, Renard BY. Detecting DNA of novel fungal pathogens using ResNets and a curated fungi-hosts data collection. Bioinformatics 2022; 38:ii168-ii174. [PMID: 36124807 DOI: 10.1093/bioinformatics/btac495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/08/2022] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND Emerging pathogens are a growing threat, but large data collections and approaches for predicting the risk associated with novel agents are limited to bacteria and viruses. Pathogenic fungi, which also pose a constant threat to public health, remain understudied. Relevant data remain comparatively scarce and scattered among many different sources, hindering the development of sequencing-based detection workflows for novel fungal pathogens. No prediction method working for agents across all three groups is available, even though the cause of an infection is often difficult to identify from symptoms alone. RESULTS We present a curated collection of fungal host range data, comprising records on human, animal and plant pathogens, as well as other plant-associated fungi, linked to publicly available genomes. We show that it can be used to predict the pathogenic potential of novel fungal species directly from DNA sequences with either sequence homology or deep learning. We develop learned, numerical representations of the collected genomes and visualize the landscape of fungal pathogenicity. Finally, we train multi-class models predicting if next-generation sequencing reads originate from novel fungal, bacterial or viral threats. CONCLUSIONS The neural networks trained using our data collection enable accurate detection of novel fungal pathogens. A curated set of over 1400 genomes with host and pathogenicity metadata supports training of machine-learning models and sequence comparison, not limited to the pathogen detection task. AVAILABILITY AND IMPLEMENTATION The data, models and code are hosted at https://zenodo.org/record/5846345, https://zenodo.org/record/5711877 and https://gitlab.com/dacs-hpi/deepac. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jakub M Bartoszewicz
- Hasso Plattner Institute for Digital Engineering, Digital Engineering Faculty, University of Potsdam, Potsdam 14482, Germany.,Department of Mathematics and Computer Science, Free University of Berlin, Berlin 14195, Germany
| | - Ferdous Nasri
- Hasso Plattner Institute for Digital Engineering, Digital Engineering Faculty, University of Potsdam, Potsdam 14482, Germany.,Department of Mathematics and Computer Science, Free University of Berlin, Berlin 14195, Germany
| | - Melania Nowicka
- Hasso Plattner Institute for Digital Engineering, Digital Engineering Faculty, University of Potsdam, Potsdam 14482, Germany.,Department of Mathematics and Computer Science, Free University of Berlin, Berlin 14195, Germany
| | - Bernhard Y Renard
- Hasso Plattner Institute for Digital Engineering, Digital Engineering Faculty, University of Potsdam, Potsdam 14482, Germany
| |
Collapse
|
10
|
PathoLive—Real-Time Pathogen Identification from Metagenomic Illumina Datasets. Life (Basel) 2022; 12:life12091345. [PMID: 36143382 PMCID: PMC9505849 DOI: 10.3390/life12091345] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2022] [Revised: 08/24/2022] [Accepted: 08/24/2022] [Indexed: 11/18/2022] Open
Abstract
Over the past years, NGS has become a crucial workhorse for open-view pathogen diagnostics. Yet, long turnaround times result from using massively parallel high-throughput technologies as the analysis can only be performed after sequencing has finished. The interpretation of results can further be challenged by contaminations, clinically irrelevant sequences, and the sheer amount and complexity of the data. We implemented PathoLive, a real-time diagnostics pipeline for the detection of pathogens from clinical samples hours before sequencing has finished. Based on real-time alignment with HiLive2, mappings are scored with respect to common contaminations, low-entropy areas, and sequences of widespread, non-pathogenic organisms. The results are visualized using an interactive taxonomic tree that provides an easily interpretable overview of the relevance of hits. For a human plasma sample that was spiked in vitro with six pathogenic viruses, all agents were clearly detected after only 40 of 200 sequencing cycles. For a real-world sample from Sudan, the results correctly indicated the presence of Crimean-Congo hemorrhagic fever virus. In a second real-world dataset from the 2019 SARS-CoV-2 outbreak in Wuhan, we found the presence of a SARS coronavirus as the most relevant hit without the novel virus reference genome being included in the database. For all samples, clinically irrelevant hits were correctly de-emphasized. Our approach is valuable to obtain fast and accurate NGS-based pathogen identifications and correctly prioritize and visualize them based on their clinical significance: PathoLive is open source and available on GitLab and BioConda.
Collapse
|
11
|
Crowdsourced benchmarking of taxonomic metagenome profilers: lessons learned from the sbv IMPROVER Microbiomics challenge. BMC Genomics 2022; 23:624. [PMID: 36042406 PMCID: PMC9429340 DOI: 10.1186/s12864-022-08803-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2022] [Accepted: 07/25/2022] [Indexed: 11/10/2022] Open
Abstract
Background Selection of optimal computational strategies for analyzing metagenomics data is a decisive step in determining the microbial composition of a sample, and this procedure is complex because of the numerous tools currently available. The aim of this research was to summarize the results of crowdsourced sbv IMPROVER Microbiomics Challenge designed to evaluate the performance of off-the-shelf metagenomics software as well as to investigate the robustness of these results by the extended post-challenge analysis. In total 21 off-the-shelf taxonomic metagenome profiling pipelines were benchmarked for their capacity to identify the microbiome composition at various taxon levels across 104 shotgun metagenomics datasets of bacterial genomes (representative of various microbiome samples) from public databases. Performance was determined by comparing predicted taxonomy profiles with the gold standard. Results Most taxonomic profilers performed homogeneously well at the phylum level but generated intermediate and heterogeneous scores at the genus and species levels, respectively. kmer-based pipelines using Kraken with and without Bracken or using CLARK-S performed best overall, but they exhibited lower precision than the two marker-gene-based methods MetaPhlAn and mOTU. Filtering out the 1% least abundance species—which were not reliably predicted—helped increase the performance of most profilers by increasing precision but at the cost of recall. However, the use of adaptive filtering thresholds determined from the sample’s Shannon index increased the performance of most kmer-based profilers while mitigating the tradeoff between precision and recall. Conclusions kmer-based metagenomic pipelines using Kraken/Bracken or CLARK-S performed most robustly across a large variety of microbiome datasets. Removing non-reliably predicted low-abundance species by using diversity-dependent adaptive filtering thresholds further enhanced the performance of these tools. This work demonstrates the applicability of computational pipelines for accurately determining taxonomic profiles in clinical and environmental contexts and exemplifies the power of crowdsourcing for unbiased evaluation. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-022-08803-2.
Collapse
|
12
|
Growth promotion and antibiotic induced metabolic shifts in the chicken gut microbiome. Commun Biol 2022; 5:293. [PMID: 35365748 PMCID: PMC8975857 DOI: 10.1038/s42003-022-03239-6] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2021] [Accepted: 03/08/2022] [Indexed: 02/07/2023] Open
Abstract
Antimicrobial growth promoters (AGP) have played a decisive role in animal agriculture for over half a century. Despite mounting concerns about antimicrobial resistance and demand for antibiotic alternatives, a thorough understanding of how these compounds drive performance is missing. Here we investigate the functional footprint of microbial communities in the cecum of chickens fed four distinct AGP. We find relatively few taxa, metabolic or antimicrobial resistance genes similarly altered across treatments, with those changes often driven by the abundances of core microbiome members. Constraints-based modeling of 25 core bacterial genera associated increased performance with fewer metabolite demands for microbial growth, pointing to altered nitrogen utilization as a potential mechanism of narasin, the AGP with the largest performance increase in our study. Untargeted metabolomics of narasin treated birds aligned with model predictions, suggesting that the core cecum microbiome might be targeted for enhanced performance via its contribution to host-microbiota metabolic crosstalk. This study compares the functional profiles of the cecal microbiome among chickens fed four different antimicrobial growth promoters. Chickens receiving narasin exhibited the largest performance increase via apparent nitrogen recycling by the core cecal microbiome.
Collapse
|
13
|
Abstract
MOTIVATION Nanopore sequencers allow targeted sequencing of interesting nucleotide sequences by rejecting other sequences from individual pores. This feature facilitates the enrichment of low-abundant sequences by depleting overrepresented ones in-silico. Existing tools for adaptive sampling either apply signal alignment, which cannot handle human-sized reference sequences, or apply read mapping in sequence space relying on fast graphical processing units (GPU) base callers for real-time read rejection. Using nanopore long-read mapping tools is also not optimal when mapping shorter reads as usually analyzed in adaptive sampling applications. RESULTS Here, we present a new approach for nanopore adaptive sampling that combines fast CPU and GPU base calling with read classification based on Interleaved Bloom Filters. ReadBouncer improves the potential enrichment of low abundance sequences by its high read classification sensitivity and specificity, outperforming existing tools in the field. It robustly removes even reads belonging to large reference sequences while running on commodity hardware without GPUs, making adaptive sampling accessible for in-field researchers. Readbouncer also provides a user-friendly interface and installer files for end-users without a bioinformatics background. AVAILABILITY AND IMPLEMENTATION The C++ source code is available at https://gitlab.com/dacs-hpi/readbouncer. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Ahmad Lutfi
- Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, 14482 Potsdam, Germany
- Department of Mathematics and Computer Science, Free University of Berlin, 14195 Berlin, Germany
| | - Kilian Rutzen
- Genome Sequencing Unit (MF2), Robert Koch Institute, 13353 Berlin, Germany
| | | |
Collapse
|
14
|
Voigt B, Fischer O, Krumnow C, Herta C, Dabrowski PW. NGS read classification using AI. PLoS One 2021; 16:e0261548. [PMID: 34936673 PMCID: PMC8694450 DOI: 10.1371/journal.pone.0261548] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Accepted: 12/03/2021] [Indexed: 11/19/2022] Open
Abstract
Clinical metagenomics is a powerful diagnostic tool, as it offers an open view into all DNA in a patient's sample. This allows the detection of pathogens that would slip through the cracks of classical specific assays. However, due to this unspecific nature of metagenomic sequencing, a huge amount of unspecific data is generated during the sequencing itself and the diagnosis only takes place at the data analysis stage where relevant sequences are filtered out. Typically, this is done by comparison to reference databases. While this approach has been optimized over the past years and works well to detect pathogens that are represented in the used databases, a common challenge in analysing a metagenomic patient sample arises when no pathogen sequences are found: How to determine whether truly no evidence of a pathogen is present in the data or whether the pathogen's genome is simply absent from the database and the sequences in the dataset could thus not be classified? Here, we present a novel approach to this problem of detecting novel pathogens in metagenomic datasets by classifying the (segments of) proteins encoded by the sequences in the datasets. We train a neural network on the sequences of coding sequences, labeled by taxonomic domain, and use this neural network to predict the taxonomic classification of sequences that can not be classified by comparison to a reference database, thus facilitating the detection of potential novel pathogens.
Collapse
Affiliation(s)
- Benjamin Voigt
- Center for Bio-Medical image and Information processing (CBMI), HTW University of Applied Sciences, Berlin, Germany
| | - Oliver Fischer
- Center for Bio-Medical image and Information processing (CBMI), HTW University of Applied Sciences, Berlin, Germany
| | - Christian Krumnow
- Center for Bio-Medical image and Information processing (CBMI), HTW University of Applied Sciences, Berlin, Germany
| | - Christian Herta
- Center for Bio-Medical image and Information processing (CBMI), HTW University of Applied Sciences, Berlin, Germany
| | - Piotr Wojciech Dabrowski
- Center for Bio-Medical image and Information processing (CBMI), HTW University of Applied Sciences, Berlin, Germany
| |
Collapse
|
15
|
Accessing Dietary Effects on the Rumen Microbiome: Different Sequencing Methods Tell Different Stories. Vet Sci 2021; 8:vetsci8070138. [PMID: 34357930 PMCID: PMC8310016 DOI: 10.3390/vetsci8070138] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2021] [Revised: 07/02/2021] [Accepted: 07/14/2021] [Indexed: 12/29/2022] Open
Abstract
The current study employed both amplicon and shotgun sequencing to examine and compare the rumen microbiome in Angus bulls fed with either a backgrounding diet (BCK) or finishing diet (HG), to assess if both methods produce comparable results. Rumen digesta samples from 16 bulls were subjected for microbial profiling. Distinctive microbial profiles were revealed by the two methods, indicating that choice of sequencing approach may be a critical facet in studies of the rumen microbiome. Shotgun-sequencing identified the presence of 303 bacterial genera and 171 archaeal species, several of which exhibited differential abundance. Amplicon-sequencing identified 48 bacterial genera, 4 archaeal species, and 9 protozoal species. Among them, 20 bacterial genera and 5 protozoal species were differentially abundant between the two diets. Overall, amplicon-sequencing showed a more drastic diet-derived effect on the ruminal microbial profile compared to shotgun-sequencing. While both methods detected dietary differences at various taxonomic levels, few consistent patterns were evident. Opposite results were seen for the phyla Firmicutes and Bacteroidetes, and the genus Selenomonas. This study showcases the importance of sequencing platform choice and suggests a need for integrative methods that allow robust comparisons of microbial data drawn from various omic approaches, allowing for comprehensive comparisons across studies.
Collapse
|
16
|
Dall'Olio D, Curti N, Fonzi E, Sala C, Remondini D, Castellani G, Giampieri E. Impact of concurrency on the performance of a whole exome sequencing pipeline. BMC Bioinformatics 2021; 22:60. [PMID: 33563206 PMCID: PMC7874478 DOI: 10.1186/s12859-020-03780-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2020] [Accepted: 09/24/2020] [Indexed: 11/12/2022] Open
Abstract
Background Current high-throughput technologies—i.e. whole genome sequencing, RNA-Seq, ChIP-Seq, etc.—generate huge amounts of data and their usage gets more widespread with each passing year. Complex analysis pipelines involving several computationally-intensive steps have to be applied on an increasing number of samples. Workflow management systems allow parallelization and a more efficient usage of computational power. Nevertheless, this mostly happens by assigning the available cores to a single or few samples’ pipeline at a time. We refer to this approach as naive parallel strategy (NPS). Here, we discuss an alternative approach, which we refer to as concurrent execution strategy (CES), which equally distributes the available processors across every sample’s pipeline. Results Theoretically, we show that the CES results, under loose conditions, in a substantial speedup, with an ideal gain range spanning from 1 to the number of samples. Also, we observe that the CES yields even faster executions since parallelly computable tasks scale sub-linearly. Practically, we tested both strategies on a whole exome sequencing pipeline applied to three publicly available matched tumour-normal sample pairs of gastrointestinal stromal tumour. The CES achieved speedups in latency up to 2–2.4 compared to the NPS. Conclusions Our results hint that if resources distribution is further tailored to fit specific situations, an even greater gain in performance of multiple samples pipelines execution could be achieved. For this to be feasible, a benchmarking of the tools included in the pipeline would be necessary. It is our opinion these benchmarks should be consistently performed by the tools’ developers. Finally, these results suggest that concurrent strategies might also lead to energy and cost savings by making feasible the usage of low power machine clusters.
Collapse
Affiliation(s)
- Daniele Dall'Olio
- Department of Physics and Astronomy, University of Bologna, 40127, Bologna, BO, Italy
| | - Nico Curti
- Department of Experimental, Diagnostic and Specialty Medicine, University of Bologna, 40138, Bologna, BO, Italy
| | - Eugenio Fonzi
- Istituto Scientifico Romagnolo per lo Studio e la Cura dei Tumori (IRST) IRCCS, 47014, Meldola, Italy
| | - Claudia Sala
- Department of Physics and Astronomy, University of Bologna, 40127, Bologna, BO, Italy.
| | - Daniel Remondini
- Department of Physics and Astronomy, University of Bologna, 40127, Bologna, BO, Italy
| | - Gastone Castellani
- Department of Experimental, Diagnostic and Specialty Medicine, University of Bologna, 40138, Bologna, BO, Italy
| | - Enrico Giampieri
- Department of Experimental, Diagnostic and Specialty Medicine, University of Bologna, 40138, Bologna, BO, Italy
| |
Collapse
|
17
|
Prasanna A, Niranjan V. Clin-mNGS: Automated Pipeline for Pathogen Detection from Clinical Metagenomic Data. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200608130029] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Background:
Since bacteria are the earliest known organisms, there has been significant
interest in their variety and biology, most certainly concerning human health. Recent advances in
Metagenomics sequencing (mNGS), a culture-independent sequencing technology, have facilitated
an accelerated development in clinical microbiology and our understanding of pathogens.
Objective:
For the implementation of mNGS in routine clinical practice to become feasible, a
practical and scalable strategy for the study of mNGS data is essential. This study presents a robust
automated pipeline to analyze clinical metagenomic data for pathogen identification and
classification.
Method:
The proposed Clin-mNGS pipeline is an integrated, open-source, scalable, reproducible,
and user-friendly framework scripted using the Snakemake workflow management software. The
implementation avoids the hassle of manual installation and configuration of the multiple commandline
tools and dependencies. The approach directly screens pathogens from clinical raw reads and
generates consolidated reports for each sample.
Results:
The pipeline is demonstrated using publicly available data and is tested on a desktop Linux
system and a High-performance cluster. The study compares variability in results from different
tools and versions. The versions of the tools are made user modifiable. The pipeline results in quality
check, filtered reads, host subtraction, assembled contigs, assembly metrics, relative abundances of
bacterial species, antimicrobial resistance genes, plasmid finding, and virulence factors
identification. The results obtained from the pipeline are evaluated based on sensitivity and positive
predictive value.
Conclusion:
Clin-mNGS is an automated Snakemake pipeline validated for the analysis of microbial
clinical metagenomics reads to perform taxonomic classification and antimicrobial resistance
prediction.
Collapse
Affiliation(s)
- Akshatha Prasanna
- Department of Biotechnology, Rashtreeya Vidyalaya College of Engineering, Bengaluru,India
| | - Vidya Niranjan
- Department of Biotechnology, Rashtreeya Vidyalaya College of Engineering, Bengaluru,India
| |
Collapse
|
18
|
Benavides A, Sanchez F, Alzate JF, Cabarcas F. DATMA: Distributed AuTomatic Metagenomic Assembly and annotation framework. PeerJ 2020; 8:e9762. [PMID: 32953263 PMCID: PMC7474881 DOI: 10.7717/peerj.9762] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2019] [Accepted: 07/28/2020] [Indexed: 11/20/2022] Open
Abstract
Background A prime objective in metagenomics is to classify DNA sequence fragments into taxonomic units. It usually requires several stages: read’s quality control, de novo assembly, contig annotation, gene prediction, etc. These stages need very efficient programs because of the number of reads from the projects. Furthermore, the complexity of metagenomes requires efficient and automatic tools that orchestrate the different stages. Method DATMA is a pipeline for fast metagenomic analysis that orchestrates the following: sequencing quality control, 16S rRNA-identification, reads binning, de novo assembly and evaluation, gene prediction, and taxonomic annotation. Its distributed computing model can use multiple computing resources to reduce the analysis time. Results We used a controlled experiment to show DATMA functionality. Two pre-annotated metagenomes to compare its accuracy and speed against other metagenomic frameworks. Then, with DATMA we recovered a draft genome of a novel Anaerolineaceae from a biosolid metagenome. Conclusions DATMA is a bioinformatics tool that automatically analyzes complex metagenomes. It is faster than similar tools and, in some cases, it can extract genomes that the other tools do not. DATMA is freely available at https://github.com/andvides/DATMA.
Collapse
Affiliation(s)
- Andres Benavides
- Grupo GICEI, Facultad de Ingeniería Electrónica, Institución Universitaria Pascual Bravo, Medellín, Antioquia, Colombia
- Grupo SISTEMIC, Ingeniería Electrónica, Facultad de Ingeniería, Universidad de Antioquia UdeA, Medellín, Colombia
| | - Friman Sanchez
- Barcelona Supercomputing Center, currently at Smart Variable S.L., Barcelona, Spain
| | - Juan F. Alzate
- Centro Nacional de Secuenciación Genómica-CNSG, Sede de Investigación Universitaria-SIU, Universidad de Antioquia UdeA, Medellín, Colombia
- Departamento de Microbiología y Parasitología, Facultad de Medicina, Universidad de Antioquia UdeA, Medellín, Colombia
| | - Felipe Cabarcas
- Centro Nacional de Secuenciación Genómica-CNSG, Sede de Investigación Universitaria-SIU, Universidad de Antioquia UdeA, Medellín, Colombia
- Grupo SISTEMIC, Ingeniería Electrónica, Facultad de Ingeniería, Universidad de Antioquia UdeA, Medellín, Colombia
| |
Collapse
|
19
|
Muñoz-Benavent M, Hartkopf F, Van Den Bossche T, Piro VC, García-Ferris C, Latorre A, Renard BY, Muth T. gNOMO: a multi-omics pipeline for integrated host and microbiome analysis of non-model organisms. NAR Genom Bioinform 2020; 2:lqaa058. [PMID: 33575609 PMCID: PMC7671378 DOI: 10.1093/nargab/lqaa058] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2020] [Revised: 06/19/2020] [Accepted: 08/03/2020] [Indexed: 01/14/2023] Open
Abstract
The study of bacterial symbioses has grown exponentially in the recent past. However, existing bioinformatic workflows of microbiome data analysis do commonly not integrate multiple meta-omics levels and are mainly geared toward human microbiomes. Microbiota are better understood when analyzed in their biological context; that is together with their host or environment. Nevertheless, this is a limitation when studying non-model organisms mainly due to the lack of well-annotated sequence references. Here, we present gNOMO, a bioinformatic pipeline that is specifically designed to process and analyze non-model organism samples of up to three meta-omics levels: metagenomics, metatranscriptomics and metaproteomics in an integrative manner. The pipeline has been developed using the workflow management framework Snakemake in order to obtain an automated and reproducible pipeline. Using experimental datasets of the German cockroach Blattella germanica, a non-model organism with very complex gut microbiome, we show the capabilities of gNOMO with regard to meta-omics data integration, expression ratio comparison, taxonomic and functional analysis as well as intuitive output visualization. In conclusion, gNOMO is a bioinformatic pipeline that can easily be configured, for integrating and analyzing multiple meta-omics data types and for producing output visualizations, specifically designed for integrating paired-end sequencing data with mass spectrometry from non-model organisms.
Collapse
Affiliation(s)
- Maria Muñoz-Benavent
- Institute for Integrative Systems Biology (I2SysBio), Universitat de València/CSIC, Paterna (València) 46980, Spain
| | - Felix Hartkopf
- Bioinformatics Unit (MF 1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, Berlin 13353, Germany
| | | | - Vitor C Piro
- Bioinformatics Unit (MF 1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, Berlin 13353, Germany
| | - Carlos García-Ferris
- Institute for Integrative Systems Biology (I2SysBio), Universitat de València/CSIC, Paterna (València) 46980, Spain
| | - Amparo Latorre
- Institute for Integrative Systems Biology (I2SysBio), Universitat de València/CSIC, Paterna (València) 46980, Spain
| | - Bernhard Y Renard
- Bioinformatics Unit (MF 1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, Berlin 13353, Germany
| | - Thilo Muth
- Bioinformatics Unit (MF 1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, Berlin 13353, Germany
| |
Collapse
|
20
|
Sim M, Lee J, Lee D, Kwon D, Kim J. TAMA: improved metagenomic sequence classification through meta-analysis. BMC Bioinformatics 2020; 21:185. [PMID: 32397982 PMCID: PMC7218625 DOI: 10.1186/s12859-020-3533-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2020] [Accepted: 05/05/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Microorganisms are important occupants of many different environments. Identifying the composition of microbes and estimating their abundance promote understanding of interactions of microbes in environmental samples. To understand their environments more deeply, the composition of microorganisms in environmental samples has been studied using metagenomes, which are the collections of genomes of the microorganisms. Although many tools have been developed for taxonomy analysis based on different algorithms, variability of analysis outputs of existing tools from the same input metagenome datasets is the main obstacle for many researchers in this field. RESULTS Here, we present a novel meta-analysis tool for metagenome taxonomy analysis, called TAMA, by intelligently integrating outputs from three different taxonomy analysis tools. Using an integrated reference database, TAMA performs taxonomy assignment for input metagenome reads based on a meta-score by integrating scores of taxonomy assignment from different taxonomy classification tools. TAMA outperformed existing tools when evaluated using various benchmark datasets. It was also successfully applied to obtain relative species abundance profiles and difference in composition of microorganisms in two types of cheese metagenome and human gut metagenome. CONCLUSION TAMA can be easily installed and used for metagenome read classification and the prediction of relative species abundance from multiple numbers and types of metagenome read samples. TAMA can be used to more accurately uncover the composition of microorganisms in metagenome samples collected from various environments, especially when the use of a single taxonomy analysis tool is unreliable. TAMA is an open source tool, and can be downloaded at https://github.com/jkimlab/TAMA.
Collapse
Affiliation(s)
- Mikang Sim
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, Republic of Korea
| | - Jongin Lee
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, Republic of Korea
| | - Daehwan Lee
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, Republic of Korea
| | - Daehong Kwon
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, Republic of Korea
| | - Jaebum Kim
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, Republic of Korea.
| |
Collapse
|
21
|
Khachatryan L, de Leeuw RH, Kraakman MEM, Pappas N, Te Raa M, Mei H, de Knijff P, Laros JFJ. Taxonomic classification and abundance estimation using 16S and WGS-A comparison using controlled reference samples. Forensic Sci Int Genet 2020; 46:102257. [PMID: 32058299 DOI: 10.1016/j.fsigen.2020.102257] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2019] [Revised: 12/30/2019] [Accepted: 01/27/2020] [Indexed: 12/30/2022]
Abstract
The assessment of microbiome biodiversity is the most common application of metagenomics. While 16S sequencing remains standard procedure for taxonomic profiling of metagenomic data, a growing number of studies have clearly demonstrated biases associated with this method. By using Whole Genome Shotgun sequencing (WGS) metagenomics, most of the known restrictions associated with 16S data are alleviated. However, due to the computationally intensive data analyses and higher sequencing costs, WGS based metagenomics remains a less popular option. Selecting the experiment type that provides a comprehensive, yet manageable amount of information is a challenge encountered in many metagenomics studies. In this work, we created a series of artificial bacterial mixes, each with a different distribution of skin-associated microbial species. These mixes were used to estimate the resolution of two different metagenomic experiments - 16S and WGS - and to evaluate several different bioinformatics approaches for taxonomic read classification. In all test cases, WGS approaches provide much more accurate results, in terms of taxa prediction and abundance estimation, in comparison to those of 16S. Furthermore, we demonstrate that a 16S dataset, analysed using different state of the art techniques and reference databases, can produce widely different results. In light of the fact that most forensic metagenomic analysis are still performed using 16S data, our results are especially important.
Collapse
Affiliation(s)
- Lusine Khachatryan
- Department of Human Genetics, Leiden University Medical Center, Leiden, the Netherlands.
| | - Rick H de Leeuw
- Department of Human Genetics, Leiden University Medical Center, Leiden, the Netherlands
| | - Margriet E M Kraakman
- Department of Medical Microbiology, Leiden University Medical Center, Leiden, the Netherlands
| | - Nikos Pappas
- Sequencing Analysis Support Core, Leiden University Medical Center, Leiden, the Netherlands
| | - Marije Te Raa
- Department of Human Genetics, Leiden University Medical Center, Leiden, the Netherlands
| | - Hailiang Mei
- Sequencing Analysis Support Core, Leiden University Medical Center, Leiden, the Netherlands
| | - Peter de Knijff
- Department of Human Genetics, Leiden University Medical Center, Leiden, the Netherlands
| | - Jeroen F J Laros
- Department of Human Genetics, Leiden University Medical Center, Leiden, the Netherlands; Department of Clinical Genetics, Leiden University Medical Center, Leiden, the Netherlands
| |
Collapse
|
22
|
Cooper RO, Cressler CE. Characterization of key bacterial species in the Daphnia magna microbiota using shotgun metagenomics. Sci Rep 2020; 10:652. [PMID: 31959775 PMCID: PMC6971282 DOI: 10.1038/s41598-019-57367-x] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2019] [Accepted: 12/24/2019] [Indexed: 12/28/2022] Open
Abstract
The keystone zooplankton Daphnia magna has recently been used as a model system for understanding host-microbiota interactions. However, the bacterial species present and functions associated with their genomes are not well understood. In order to understand potential functions of these species, we combined 16S rRNA sequencing and shotgun metagenomics to characterize the whole-organism microbiota of Daphnia magna. We assembled five potentially novel metagenome-assembled genomes (MAGs) of core bacteria in Daphnia magna. Genes involved in host colonization and immune system evasion were detected across the MAGs. Some metabolic pathways were specific to some MAGs, including sulfur oxidation, nitrate reduction, and flagellar assembly. Amino acid exporters were identified in MAGs identified as important for host fitness, and pathways for key vitamin biosynthesis and export were identified across MAGs. In total, our examination of functions in these MAGs shows a diversity of nutrient acquisition and metabolism pathways present that may benefit the host, as well as genomic signatures of host association and immune system evasion.
Collapse
Affiliation(s)
- Reilly O Cooper
- School of Biological Sciences, University of Nebraska, Lincoln, NE, USA.
| | | |
Collapse
|
23
|
Gihawi A, Rallapalli G, Hurst R, Cooper CS, Leggett RM, Brewer DS. SEPATH: benchmarking the search for pathogens in human tissue whole genome sequence data leads to template pipelines. Genome Biol 2019; 20:208. [PMID: 31639030 PMCID: PMC6805339 DOI: 10.1186/s13059-019-1819-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2019] [Accepted: 09/11/2019] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Human tissue is increasingly being whole genome sequenced as we transition into an era of genomic medicine. With this arises the potential to detect sequences originating from microorganisms, including pathogens amid the plethora of human sequencing reads. In cancer research, the tumorigenic ability of pathogens is being recognized, for example, Helicobacter pylori and human papillomavirus in the cases of gastric non-cardia and cervical carcinomas, respectively. As of yet, no benchmark has been carried out on the performance of computational approaches for bacterial and viral detection within host-dominated sequence data. RESULTS We present the results of benchmarking over 70 distinct combinations of tools and parameters on 100 simulated cancer datasets spiked with realistic proportions of bacteria. mOTUs2 and Kraken are the highest performing individual tools achieving median genus-level F1 scores of 0.90 and 0.91, respectively. mOTUs2 demonstrates a high performance in estimating bacterial proportions. Employing Kraken on unassembled sequencing reads produces a good but variable performance depending on post-classification filtering parameters. These approaches are investigated on a selection of cervical and gastric cancer whole genome sequences where Alphapapillomavirus and Helicobacter are detected in addition to a variety of other interesting genera. CONCLUSIONS We provide the top-performing pipelines from this benchmark in a unifying tool called SEPATH, which is amenable to high throughput sequencing studies across a range of high-performance computing clusters. SEPATH provides a benchmarked and convenient approach to detect pathogens in tissue sequence data helping to determine the relationship between metagenomics and disease.
Collapse
Affiliation(s)
- Abraham Gihawi
- Norwich Medical School, University of East Anglia, Bob Champion Research and Education Building, Norwich, NR4 7UQ UK
| | - Ghanasyam Rallapalli
- Norwich Medical School, University of East Anglia, Bob Champion Research and Education Building, Norwich, NR4 7UQ UK
| | - Rachel Hurst
- Norwich Medical School, University of East Anglia, Bob Champion Research and Education Building, Norwich, NR4 7UQ UK
| | - Colin S. Cooper
- Norwich Medical School, University of East Anglia, Bob Champion Research and Education Building, Norwich, NR4 7UQ UK
- Functional Crosscutting Genomics England Clinical Interpretation Partnership (GeCIP) Domain Lead, 100,000 Genomes Project, Genomics England, London, UK
| | | | - Daniel S. Brewer
- Norwich Medical School, University of East Anglia, Bob Champion Research and Education Building, Norwich, NR4 7UQ UK
- Norwich Research Park, Earlham Institute, Norwich, NR4 7UZ UK
| |
Collapse
|
24
|
Ye SH, Siddle KJ, Park DJ, Sabeti PC. Benchmarking Metagenomics Tools for Taxonomic Classification. Cell 2019; 178:779-794. [PMID: 31398336 PMCID: PMC6716367 DOI: 10.1016/j.cell.2019.07.010] [Citation(s) in RCA: 249] [Impact Index Per Article: 49.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2019] [Revised: 06/18/2019] [Accepted: 07/08/2019] [Indexed: 01/17/2023]
Abstract
Metagenomic sequencing is revolutionizing the detection and characterization of microbial species, and a wide variety of software tools are available to perform taxonomic classification of these data. The fast pace of development of these tools and the complexity of metagenomic data make it important that researchers are able to benchmark their performance. Here, we review current approaches for metagenomic analysis and evaluate the performance of 20 metagenomic classifiers using simulated and experimental datasets. We describe the key metrics used to assess performance, offer a framework for the comparison of additional classifiers, and discuss the future of metagenomic data analysis.
Collapse
Affiliation(s)
- Simon H Ye
- Harvard-MIT Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
| | - Katherine J Siddle
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Center for Systems Biology, Department of Organismal and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA
| | - Daniel J Park
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Pardis C Sabeti
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Center for Systems Biology, Department of Organismal and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA; Department of Immunology and Infectious Disease, Harvard School of Public Health, Boston, MA 02115, USA; Howard Hughes Medical Institute (HHMI), Chevy Chase, MD 20815, USA
| |
Collapse
|
25
|
Seiler E, Trappe K, Renard BY. Where did you come from, where did you go: Refining metagenomic analysis tools for horizontal gene transfer characterisation. PLoS Comput Biol 2019; 15:e1007208. [PMID: 31335917 PMCID: PMC6677323 DOI: 10.1371/journal.pcbi.1007208] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2019] [Revised: 08/02/2019] [Accepted: 06/24/2019] [Indexed: 12/22/2022] Open
Abstract
Horizontal gene transfer (HGT) has changed the way we regard evolution. Instead of waiting for the next generation to establish new traits, especially bacteria are able to take a shortcut via HGT that enables them to pass on genes from one individual to another, even across species boundaries. The tool Daisy offers the first HGT detection approach based on read mapping that provides complementary evidence compared to existing methods. However, Daisy relies on the acceptor and donor organism involved in the HGT being known. We introduce DaisyGPS, a mapping-based pipeline that is able to identify acceptor and donor reference candidates of an HGT event based on sequencing reads. Acceptor and donor identification is akin to species identification in metagenomic samples based on sequencing reads, a problem addressed by metagenomic profiling tools. However, acceptor and donor references have certain properties such that these methods cannot be directly applied. DaisyGPS uses MicrobeGPS, a metagenomic profiling tool tailored towards estimating the genomic distance between organisms in the sample and the reference database. We enhance the underlying scoring system of MicrobeGPS to account for the sequence patterns in terms of mapping coverage of an acceptor and donor involved in an HGT event, and report a ranked list of reference candidates. These candidates can then be further evaluated by tools like Daisy to establish HGT regions. We successfully validated our approach on both simulated and real data, and show its benefits in an investigation of an outbreak involving Methicillin-resistant Staphylococcus aureus data.
Collapse
Affiliation(s)
- Enrico Seiler
- Bioinformatics Unit (MF1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, Berlin, Germany
- Efficient Algorithms for Omics Data, Max Planck Institute for Molecular Genetics, and Algorithmic Bioinformatics, Institute for Bioinformatics, Freie Universität Berlin, Berlin, Germany
| | - Kathrin Trappe
- Bioinformatics Unit (MF1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, Berlin, Germany
| | - Bernhard Y. Renard
- Bioinformatics Unit (MF1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, Berlin, Germany
| |
Collapse
|
26
|
Somerville V, Lutz S, Schmid M, Frei D, Moser A, Irmler S, Frey JE, Ahrens CH. Long-read based de novo assembly of low-complexity metagenome samples results in finished genomes and reveals insights into strain diversity and an active phage system. BMC Microbiol 2019; 19:143. [PMID: 31238873 PMCID: PMC6593500 DOI: 10.1186/s12866-019-1500-0] [Citation(s) in RCA: 76] [Impact Index Per Article: 15.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2018] [Accepted: 05/31/2019] [Indexed: 01/18/2023] Open
Abstract
BACKGROUND Complete and contiguous genome assemblies greatly improve the quality of subsequent systems-wide functional profiling studies and the ability to gain novel biological insights. While a de novo genome assembly of an isolated bacterial strain is in most cases straightforward, more informative data about co-existing bacteria as well as synergistic and antagonistic effects can be obtained from a direct analysis of microbial communities. However, the complexity of metagenomic samples represents a major challenge. While third generation sequencing technologies have been suggested to enable finished metagenome-assembled genomes, to our knowledge, the complete genome assembly of all dominant strains in a microbiome sample has not been demonstrated. Natural whey starter cultures (NWCs) are used in cheese production and represent low-complexity microbiomes. Previous studies of Swiss Gruyère and selected Italian hard cheeses, mostly based on amplicon metagenomics, concurred that three species generally pre-dominate: Streptococcus thermophilus, Lactobacillus helveticus and Lactobacillus delbrueckii. RESULTS Two NWCs from Swiss Gruyère producers were subjected to whole metagenome shotgun sequencing using the Pacific Biosciences Sequel and Illumina MiSeq platforms. In addition, longer Oxford Nanopore Technologies MinION reads had to be generated for one to resolve repeat regions. Thereby, we achieved the complete assembly of all dominant bacterial genomes from these low-complexity NWCs, which was corroborated by a 16S rRNA amplicon survey. Moreover, two distinct L. helveticus strains were successfully co-assembled from the same sample. Besides bacterial chromosomes, we could also assemble several bacterial plasmids and phages and a corresponding prophage. Biologically relevant insights were uncovered by linking the plasmids and phages to their respective host genomes using DNA methylation motifs on the plasmids and by matching prokaryotic CRISPR spacers with the corresponding protospacers on the phages. These results could only be achieved by employing long-read sequencing data able to span intragenomic as well as intergenomic repeats. CONCLUSIONS Here, we demonstrate the feasibility of complete de novo genome assembly of all dominant strains from low-complexity NWCs based on whole metagenomics shotgun sequencing data. This allowed to gain novel biological insights and is a fundamental basis for subsequent systems-wide omics analyses, functional profiling and phenotype to genotype analysis of specific microbial communities.
Collapse
Affiliation(s)
- Vincent Somerville
- Agroscope, Research Group Molecular Diagnostics, Genomics & Bioinformatics, Schloss 1, CH-8820 Wädenswil, Switzerland
- SIB Swiss Institute of Bioinformatics, CH-8820 Wädenswil, Switzerland
| | - Stefanie Lutz
- Agroscope, Research Group Molecular Diagnostics, Genomics & Bioinformatics, Schloss 1, CH-8820 Wädenswil, Switzerland
- SIB Swiss Institute of Bioinformatics, CH-8820 Wädenswil, Switzerland
| | - Michael Schmid
- Agroscope, Research Group Molecular Diagnostics, Genomics & Bioinformatics, Schloss 1, CH-8820 Wädenswil, Switzerland
- SIB Swiss Institute of Bioinformatics, CH-8820 Wädenswil, Switzerland
| | - Daniel Frei
- Agroscope, Research Group Molecular Diagnostics, Genomics & Bioinformatics, Schloss 1, CH-8820 Wädenswil, Switzerland
| | - Aline Moser
- Agroscope, Research Group Biochemistry of Milk and Microorganisms, CH-3003 Bern, Switzerland
| | - Stefan Irmler
- Agroscope, Research Group Biochemistry of Milk and Microorganisms, CH-3003 Bern, Switzerland
| | - Jürg E. Frey
- Agroscope, Research Group Molecular Diagnostics, Genomics & Bioinformatics, Schloss 1, CH-8820 Wädenswil, Switzerland
| | - Christian H. Ahrens
- Agroscope, Research Group Molecular Diagnostics, Genomics & Bioinformatics, Schloss 1, CH-8820 Wädenswil, Switzerland
- SIB Swiss Institute of Bioinformatics, CH-8820 Wädenswil, Switzerland
| |
Collapse
|
27
|
Mesophilic Sporeformers Identified in Whey Powder by Using Shotgun Metagenomic Sequencing. Appl Environ Microbiol 2018; 84:AEM.01305-18. [PMID: 30076196 DOI: 10.1128/aem.01305-18] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2018] [Accepted: 07/31/2018] [Indexed: 01/19/2023] Open
Abstract
Spoilage and pathogenic spore-forming bacteria are a major cause of concern for producers of dairy products. Traditional agar-based detection methods employed by the dairy industry have limitations with respect to their sensitivity and specificity. The aim of this study was to identify low-abundance sporeformers in samples of a powdered dairy product, whey powder, produced monthly over 1 year, using novel culture-independent shotgun metagenomics-based approaches. Although mesophilic sporeformers were the main target of this study, in one instance thermophilic sporeformers were also targeted using this culture-independent approach. For comparative purposes, mesophilic and thermophilic sporeformers were also tested for within the same sample using culture-based approaches. Ultimately, the approaches taken highlighted differences in the taxa identified due to treatment and isolation methods. Despite this, low levels of transient, mesophilic, and in some cases potentially pathogenic sporeformers were consistently detected in powder samples. Although the specific sporeformers changed from one month to the next, it was apparent that 3 groups of mesophilic sporeformers, namely, Bacillus cereus, Bacillus licheniformis/Bacillus paralicheniformis, and a third, more heterogeneous group containing Brevibacillus brevis, dominated across the 12 samples. Total thermophilic sporeformer taxonomy was considerably different from mesophilic taxonomy, as well as from the culturable thermophilic taxonomy, in the one sample analyzed by all four approaches. Ultimately, through the application of shotgun metagenomic sequencing to dairy powders, the potential for this technology to facilitate the detection of undesirable bacteria present in these food ingredients is highlighted.IMPORTANCE The ability of sporeformers to remain dormant in a desiccated state is of concern from a safety and spoilage perspective in dairy powder. Traditional culturing techniques are slow and provide little information without further investigation. We describe the identification of mesophilic sporeformers present in powders produced over 1 year, using novel shotgun metagenomic sequencing. This method allows detection and identification of possible pathogens and spoilage bacteria in parallel. Strain-level analysis and functional gene analysis, such as identification of toxin genes, were also performed. This approach has the potential to be of great value with respect to the detection of spore-forming bacteria and could allow a processor to make an informed decision surrounding process changes to reduce the risk of spore contamination.
Collapse
|
28
|
Uritskiy GV, DiRuggiero J, Taylor J. MetaWRAP-a flexible pipeline for genome-resolved metagenomic data analysis. MICROBIOME 2018; 6:158. [PMID: 30219103 PMCID: PMC6138922 DOI: 10.1186/s40168-018-0541-1] [Citation(s) in RCA: 955] [Impact Index Per Article: 159.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Accepted: 08/29/2018] [Indexed: 05/18/2023]
Abstract
BACKGROUND The study of microbiomes using whole-metagenome shotgun sequencing enables the analysis of uncultivated microbial populations that may have important roles in their environments. Extracting individual draft genomes (bins) facilitates metagenomic analysis at the single genome level. Software and pipelines for such analysis have become diverse and sophisticated, resulting in a significant burden for biologists to access and use them. Furthermore, while bin extraction algorithms are rapidly improving, there is still a lack of tools for their evaluation and visualization. RESULTS To address these challenges, we present metaWRAP, a modular pipeline software for shotgun metagenomic data analysis. MetaWRAP deploys state-of-the-art software to handle metagenomic data processing starting from raw sequencing reads and ending in metagenomic bins and their analysis. MetaWRAP is flexible enough to give investigators control over the analysis, while still being easy-to-install and easy-to-use. It includes hybrid algorithms that leverage the strengths of a variety of software to extract and refine high-quality bins from metagenomic data through bin consolidation and reassembly. MetaWRAP's hybrid bin extraction algorithm outperforms individual binning approaches and other bin consolidation programs in both synthetic and real data sets. Finally, metaWRAP comes with numerous modules for the analysis of metagenomic bins, including taxonomy assignment, abundance estimation, functional annotation, and visualization. CONCLUSIONS MetaWRAP is an easy-to-use modular pipeline that automates the core tasks in metagenomic analysis, while contributing significant improvements to the extraction and interpretation of high-quality metagenomic bins. The bin refinement and reassembly modules of metaWRAP consistently outperform other binning approaches. Each module of metaWRAP is also a standalone component, making it a flexible and versatile tool for tackling metagenomic shotgun sequencing data. MetaWRAP is open-source software available at https://github.com/bxlab/metaWRAP .
Collapse
Affiliation(s)
- Gherman V. Uritskiy
- Department of Biology, Johns Hopkins University, 3400 N Charles St., Baltimore, MD 21218 USA
| | - Jocelyne DiRuggiero
- Department of Biology, Johns Hopkins University, 3400 N Charles St., Baltimore, MD 21218 USA
| | - James Taylor
- Department of Biology, Johns Hopkins University, 3400 N Charles St., Baltimore, MD 21218 USA
| |
Collapse
|
29
|
Walsh AM, Crispie F, O'Sullivan O, Finnegan L, Claesson MJ, Cotter PD. Species classifier choice is a key consideration when analysing low-complexity food microbiome data. MICROBIOME 2018; 6:50. [PMID: 29554948 PMCID: PMC5859664 DOI: 10.1186/s40168-018-0437-0] [Citation(s) in RCA: 46] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/01/2017] [Accepted: 03/05/2018] [Indexed: 05/03/2023]
Abstract
BACKGROUND The use of shotgun metagenomics to analyse low-complexity microbial communities in foods has the potential to be of considerable fundamental and applied value. However, there is currently no consensus with respect to choice of species classification tool, platform, or sequencing depth. Here, we benchmarked the performances of three high-throughput short-read sequencing platforms, the Illumina MiSeq, NextSeq 500, and Ion Proton, for shotgun metagenomics of food microbiota. Briefly, we sequenced six kefir DNA samples and a mock community DNA sample, the latter constructed by evenly mixing genomic DNA from 13 food-related bacterial species. A variety of bioinformatic tools were used to analyse the data generated, and the effects of sequencing depth on these analyses were tested by randomly subsampling reads. RESULTS Compositional analysis results were consistent between the platforms at divergent sequencing depths. However, we observed pronounced differences in the predictions from species classification tools. Indeed, PERMANOVA indicated that there was no significant differences between the compositional results generated by the different sequencers (p = 0.693, R2 = 0.011), but there was a significant difference between the results predicted by the species classifiers (p = 0.01, R2 = 0.127). The relative abundances predicted by the classifiers, apart from MetaPhlAn2, were apparently biased by reference genome sizes. Additionally, we observed varying false-positive rates among the classifiers. MetaPhlAn2 had the lowest false-positive rate, whereas SLIMM had the greatest false-positive rate. Strain-level analysis results were also similar across platforms. Each platform correctly identified the strains present in the mock community, but accuracy was improved slightly with greater sequencing depth. Notably, PanPhlAn detected the dominant strains in each kefir sample above 500,000 reads per sample. Again, the outputs from functional profiling analysis using SUPER-FOCUS were generally accordant between the platforms at different sequencing depths. Finally, and expectedly, metagenome assembly completeness was significantly lower on the MiSeq than either on the NextSeq (p = 0.03) or the Proton (p = 0.011), and it improved with increased sequencing depth. CONCLUSIONS Our results demonstrate a remarkable similarity in the results generated by the three sequencing platforms at different sequencing depths, and, in fact, the choice of bioinformatics methodology had a more evident impact on results than the choice of sequencer did.
Collapse
Affiliation(s)
- Aaron M Walsh
- Teagasc Food Research Centre, Moorepark, Fermoy, Co. Cork, Ireland
- APC Microbiome Institute, University College Cork, Co. Cork, Ireland
- Microbiology Department, University College Cork, Co. Cork, Ireland
| | - Fiona Crispie
- Teagasc Food Research Centre, Moorepark, Fermoy, Co. Cork, Ireland
- APC Microbiome Institute, University College Cork, Co. Cork, Ireland
| | - Orla O'Sullivan
- Teagasc Food Research Centre, Moorepark, Fermoy, Co. Cork, Ireland
- APC Microbiome Institute, University College Cork, Co. Cork, Ireland
| | - Laura Finnegan
- Teagasc Food Research Centre, Moorepark, Fermoy, Co. Cork, Ireland
- APC Microbiome Institute, University College Cork, Co. Cork, Ireland
| | - Marcus J Claesson
- APC Microbiome Institute, University College Cork, Co. Cork, Ireland
- Microbiology Department, University College Cork, Co. Cork, Ireland
| | - Paul D Cotter
- Teagasc Food Research Centre, Moorepark, Fermoy, Co. Cork, Ireland.
- APC Microbiome Institute, University College Cork, Co. Cork, Ireland.
| |
Collapse
|
30
|
Neves ALA, Li F, Ghoshal B, McAllister T, Guan LL. Enhancing the Resolution of Rumen Microbial Classification from Metatranscriptomic Data Using Kraken and Mothur. Front Microbiol 2017; 8:2445. [PMID: 29270165 PMCID: PMC5725470 DOI: 10.3389/fmicb.2017.02445] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2017] [Accepted: 11/24/2017] [Indexed: 12/23/2022] Open
Abstract
The advent of next generation sequencing and bioinformatics tools have greatly advanced our knowledge about the phylogenetic diversity and ecological role of microbes inhabiting the mammalian gut. However, there is a lack of information on the evaluation of these computational tools in the context of the rumen microbiome as these programs have mostly been benchmarked on real or simulated datasets generated from human studies. In this study, we compared the outcomes of two methods, Kraken (mRNA based) and a pipeline developed in-house based on Mothur (16S rRNA based), to assess the taxonomic profiles (bacteria and archaea) of rumen microbial communities using total RNA sequencing of rumen fluid collected from 12 cattle with differing feed conversion ratios (FCR). Both approaches revealed a similar phyla distribution of the most abundant taxa, with Bacteroidetes, Firmicutes, and Proteobacteria accounting for approximately 80% of total bacterial abundance. For bacterial taxa, although 69 genera were commonly detected by both methods, an additional 159 genera were exclusively identified by Kraken. Kraken detected 423 species, while Mothur was not able to assign bacterial sequences to the species level. For archaea, both methods generated similar results only for the abundance of Methanomassiliicoccaceae (previously referred as RCC), which comprised more than 65% of the total archaeal families. Taxon R4-41B was exclusively identified by Mothur in the rumen of feed efficient bulls, whereas Kraken uniquely identified Methanococcaceae in inefficient bulls. Although Kraken enhanced the microbial classification at the species level, identification of bacteria or archaea in the rumen is limited due to a lack of reference genomes for the rumen microbiome. The findings from this study suggest that the development of the combined pipelines using Mothur and Kraken is needed for a more inclusive and representative classification of microbiomes.
Collapse
Affiliation(s)
- Andre L A Neves
- Department of Agricultural, Food and Nutritional Science, University of Alberta, Edmonton, AB, Canada
| | - Fuyong Li
- Department of Agricultural, Food and Nutritional Science, University of Alberta, Edmonton, AB, Canada
| | - Bibaswan Ghoshal
- Department of Agricultural, Food and Nutritional Science, University of Alberta, Edmonton, AB, Canada
| | - Tim McAllister
- Lethbridge Research Centre, Agriculture and Agri-Food Canada, Lethbridge, AB, Canada
| | - Le L Guan
- Department of Agricultural, Food and Nutritional Science, University of Alberta, Edmonton, AB, Canada
| |
Collapse
|