1
|
Mallawaarachchi V, Wickramarachchi A, Xue H, Papudeshi B, Grigson SR, Bouras G, Prahl RE, Kaphle A, Verich A, Talamantes-Becerra B, Dinsdale EA, Edwards RA. Solving genomic puzzles: computational methods for metagenomic binning. Brief Bioinform 2024; 25:bbae372. [PMID: 39082646 PMCID: PMC11289683 DOI: 10.1093/bib/bbae372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 06/05/2024] [Accepted: 07/15/2024] [Indexed: 08/03/2024] Open
Abstract
Metagenomics involves the study of genetic material obtained directly from communities of microorganisms living in natural environments. The field of metagenomics has provided valuable insights into the structure, diversity and ecology of microbial communities. Once an environmental sample is sequenced and processed, metagenomic binning clusters the sequences into bins representing different taxonomic groups such as species, genera, or higher levels. Several computational tools have been developed to automate the process of metagenomic binning. These tools have enabled the recovery of novel draft genomes of microorganisms allowing us to study their behaviors and functions within microbial communities. This review classifies and analyzes different approaches of metagenomic binning and different refinement, visualization, and evaluation techniques used by these methods. Furthermore, the review highlights the current challenges and areas of improvement present within the field of research.
Collapse
Affiliation(s)
- Vijini Mallawaarachchi
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - Anuradha Wickramarachchi
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Hansheng Xue
- School of Computing, National University of Singapore, Singapore 119077, Singapore
| | - Bhavya Papudeshi
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - Susanna R Grigson
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - George Bouras
- Adelaide Medical School, Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, SA 5005, Australia
- The Department of Surgery—Otolaryngology Head and Neck Surgery, University of Adelaide and the Basil Hetzel Institute for Translational Health Research, Central Adelaide Local Health Network, Adelaide, SA 5011, Australia
| | - Rosa E Prahl
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Anubhav Kaphle
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Andrey Verich
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
- The Kirby Institute, The University of New South Wales, Randwick, Sydney, NSW 2052, Australia
| | - Berenice Talamantes-Becerra
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Elizabeth A Dinsdale
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - Robert A Edwards
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| |
Collapse
|
2
|
Lema NK, Gemeda MT, Woldesemayat AA. Recent Advances in Metagenomic Approaches, Applications, and Challenge. Curr Microbiol 2023; 80:347. [PMID: 37733134 DOI: 10.1007/s00284-023-03451-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2022] [Accepted: 08/20/2023] [Indexed: 09/22/2023]
Abstract
Advances in metagenomics analysis with the advent of next-generation sequencing have extended our knowledge of microbial communities as compared to conventional techniques providing advanced approach to identify novel and uncultivable microorganisms based on their genetic information derived from a particular environment. Shotgun metagenomics involves investigating the DNA of the entire community without the requirement of PCR amplification. It provides access to study all genes present in the sample. On the other hand, amplicon sequencing targets taxonomically important marker genes, the analysis of which is restricted to previously known DNA sequences. While sequence-based metagenomics is used to analyze DNA sequences directly from the environment without the requirement of library construction and with limited identification of novel genes and products that can be complemented by functional genomics, function-based metagenomics requires fragmentation and cloning of extracted metagenome DNA in a suitable host with subsequent functional screening and sequencing clone for detection of a novel gene. Although advances were made in metagenomics, different challenges arise. This review provides insight into advances in the metagenomic approaches combined with next-generation sequencing, their recent applications highlighting the emerging ones, such as in astrobiology, forensic sciences, and SARS-CoV-2 infection diagnosis, and the challenges associated. This review further discusses the different types of metagenomics and outlines advancements in bioinformatics tools and their significance in the analysis of metagenomic datasets.
Collapse
Affiliation(s)
- Niguse K Lema
- Department of Biotechnology, College of Biological and Chemical Engineering, Addis Ababa Science and Technology University, Addis Ababa, Ethiopia
- Biotechnology and Bioprocess Center of Excellence, Addis Ababa Science and Technology University, Addis Ababa, Ethiopia
- Department of Biotechnology, Arba Minch University, Arba Minch, Ethiopia
| | - Mesfin T Gemeda
- Department of Biotechnology, College of Biological and Chemical Engineering, Addis Ababa Science and Technology University, Addis Ababa, Ethiopia
- Biotechnology and Bioprocess Center of Excellence, Addis Ababa Science and Technology University, Addis Ababa, Ethiopia
| | - Adugna A Woldesemayat
- Department of Biotechnology, College of Biological and Chemical Engineering, Addis Ababa Science and Technology University, Addis Ababa, Ethiopia.
- Biotechnology and Bioprocess Center of Excellence, Addis Ababa Science and Technology University, Addis Ababa, Ethiopia.
| |
Collapse
|
3
|
Jiang Z, Li X, Guo L. MetaCRS: unsupervised clustering of contigs with the recursive strategy of reducing metagenomic dataset's complexity. BMC Bioinformatics 2022; 22:315. [PMID: 35045830 PMCID: PMC8772042 DOI: 10.1186/s12859-021-04227-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2021] [Accepted: 06/01/2021] [Indexed: 01/02/2023] Open
Abstract
Background Metagenomics technology can directly extract microbial genetic material from the environmental samples to obtain their sequencing reads, which can be further assembled into contigs through assembly tools. Clustering methods of contigs are subsequently applied to recover complete genomes from environmental samples. The main problems with current clustering methods are that they cannot recover more high-quality genes from complex environments. Firstly, there are multiple strains under the same species, resulting in assembly of chimeras. Secondly, different strains under the same species are difficult to be classified. Thirdly, it is difficult to determine the number of strains during the clustering process. Results In view of the shortcomings of current clustering methods, we propose an unsupervised clustering method which can improve the ability to recover genes from complex environments and a new method for selecting the number of sample’s strains in clustering process. The sequence composition characteristics (tetranucleotide frequency) and co-abundance are combined to train the probability model for clustering. A new recursive method that can continuously reduce the complexity of the samples is proposed to improve the ability to recover genes from complex environments. The new clustering method was tested on both simulated and real metagenomic datasets, and compared with five state-of-the-art methods including CONCOCT, Maxbin2.0, MetaBAT, MyCC and COCACOLA. In terms of the number and quality of recovered genes from metagenomic datasets, the results show that our proposed method is more effective. Conclusions A new contigs clustering method is proposed, which can recover more high-quality genes from complex environmental samples.
Collapse
Affiliation(s)
- Zhongjun Jiang
- College of Information Science and Technology, Ningbo University, Ningbo, 315211, China
| | - Xiaobo Li
- College of Mathematics and Computer Science, Zhejiang Normal University, Jinhua, 321004, China. .,College of Engineering, Lishui University, Lishui, 323000, China.
| | - Lijun Guo
- College of Information Science and Technology, Ningbo University, Ningbo, 315211, China
| |
Collapse
|
4
|
Bharti R, Grimm DG. Current challenges and best-practice protocols for microbiome analysis. Brief Bioinform 2021; 22:178-193. [PMID: 31848574 PMCID: PMC7820839 DOI: 10.1093/bib/bbz155] [Citation(s) in RCA: 232] [Impact Index Per Article: 77.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2019] [Revised: 10/23/2019] [Accepted: 11/06/2019] [Indexed: 12/15/2022] Open
Abstract
Analyzing the microbiome of diverse species and environments using next-generation sequencing techniques has significantly enhanced our understanding on metabolic, physiological and ecological roles of environmental microorganisms. However, the analysis of the microbiome is affected by experimental conditions (e.g. sequencing errors and genomic repeats) and computationally intensive and cumbersome downstream analysis (e.g. quality control, assembly, binning and statistical analyses). Moreover, the introduction of new sequencing technologies and protocols led to a flood of new methodologies, which also have an immediate effect on the results of the analyses. The aim of this work is to review the most important workflows for 16S rRNA sequencing and shotgun and long-read metagenomics, as well as to provide best-practice protocols on experimental design, sample processing, sequencing, assembly, binning, annotation and visualization. To simplify and standardize the computational analysis, we provide a set of best-practice workflows for 16S rRNA and metagenomic sequencing data (available at https://github.com/grimmlab/MicrobiomeBestPracticeReview).
Collapse
Affiliation(s)
- Richa Bharti
- Weihenstephan-Triesdorf University of Applied Sciences and Technical University of Munich, TUM Campus Straubing for Biotechnology and Sustainability, Straubing, Germany
| | - Dominik G Grimm
- Weihenstephan-Triesdorf University of Applied Sciences and Technical University of Munich, TUM Campus Straubing for Biotechnology and Sustainability, Straubing, Germany
| |
Collapse
|
5
|
Liu Y, Hou T, Miao Y, Liu M, Liu F. IM-c-means: a new clustering algorithm for clusters with skewed distributions. Pattern Anal Appl 2020. [DOI: 10.1007/s10044-020-00932-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
6
|
Mallawaarachchi V, Wickramarachchi A, Lin Y. GraphBin: refined binning of metagenomic contigs using assembly graphs. Bioinformatics 2020; 36:3307-3313. [PMID: 32167528 DOI: 10.1093/bioinformatics/btaa180] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 02/18/2020] [Accepted: 03/10/2020] [Indexed: 12/17/2022] Open
Abstract
MOTIVATION The field of metagenomics has provided valuable insights into the structure, diversity and ecology within microbial communities. One key step in metagenomics analysis is to assemble reads into longer contigs which are then binned into groups of contigs that belong to different species present in the metagenomic sample. Binning of contigs plays an important role in metagenomics and most available binning algorithms bin contigs using genomic features such as oligonucleotide/k-mer composition and contig coverage. As metagenomic contigs are derived from the assembly process, they are output from the underlying assembly graph which contains valuable connectivity information between contigs that can be used for binning. RESULTS We propose GraphBin, a new binning method that makes use of the assembly graph and applies a label propagation algorithm to refine the binning result of existing tools. We show that GraphBin can make use of the assembly graphs constructed from both the de Bruijn graph and the overlap-layout-consensus approach. Moreover, we demonstrate improved experimental results from GraphBin in terms of identifying mis-binned contigs and binning of contigs discarded by existing binning tools. To the best of our knowledge, this is the first time that the information from the assembly graph has been used in a tool for the binning of metagenomic contigs. AVAILABILITY AND IMPLEMENTATION The source code of GraphBin is available at https://github.com/Vini2/GraphBin. CONTACT vijini.mallawaarachchi@anu.edu.au or yu.lin@anu.edu.au. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Vijini Mallawaarachchi
- Research School of Computer Science, College of Engineering and Computer Science, Australian National University, Canberra ACT 0200, Australia
| | - Anuradha Wickramarachchi
- Research School of Computer Science, College of Engineering and Computer Science, Australian National University, Canberra ACT 0200, Australia
| | - Yu Lin
- Research School of Computer Science, College of Engineering and Computer Science, Australian National University, Canberra ACT 0200, Australia
| |
Collapse
|
7
|
Pérez-Cobas AE, Gomez-Valero L, Buchrieser C. Metagenomic approaches in microbial ecology: an update on whole-genome and marker gene sequencing analyses. Microb Genom 2020; 6:mgen000409. [PMID: 32706331 PMCID: PMC7641418 DOI: 10.1099/mgen.0.000409] [Citation(s) in RCA: 47] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2019] [Accepted: 06/30/2020] [Indexed: 12/23/2022] Open
Abstract
Metagenomics and marker gene approaches, coupled with high-throughput sequencing technologies, have revolutionized the field of microbial ecology. Metagenomics is a culture-independent method that allows the identification and characterization of organisms from all kinds of samples. Whole-genome shotgun sequencing analyses the total DNA of a chosen sample to determine the presence of micro-organisms from all domains of life and their genomic content. Importantly, the whole-genome shotgun sequencing approach reveals the genomic diversity present, but can also give insights into the functional potential of the micro-organisms identified. The marker gene approach is based on the sequencing of a specific gene region. It allows one to describe the microbial composition based on the taxonomic groups present in the sample. It is frequently used to analyse the biodiversity of microbial ecosystems. Despite its importance, the analysis of metagenomic sequencing and marker gene data is quite a challenge. Here we review the primary workflows and software used for both approaches and discuss the current challenges in the field.
Collapse
Affiliation(s)
- Ana Elena Pérez-Cobas
- Institut Pasteur, Biologie des Bactéries Intracellulaires, Paris, France and CNRS UMR 3525, 675724, Paris, France
| | - Laura Gomez-Valero
- Institut Pasteur, Biologie des Bactéries Intracellulaires, Paris, France and CNRS UMR 3525, 675724, Paris, France
| | - Carmen Buchrieser
- Institut Pasteur, Biologie des Bactéries Intracellulaires, Paris, France and CNRS UMR 3525, 675724, Paris, France
| |
Collapse
|
8
|
Breitwieser FP, Lu J, Salzberg SL. A review of methods and databases for metagenomic classification and assembly. Brief Bioinform 2019; 20:1125-1136. [PMID: 29028872 PMCID: PMC6781581 DOI: 10.1093/bib/bbx120] [Citation(s) in RCA: 261] [Impact Index Per Article: 52.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2017] [Revised: 08/22/2017] [Indexed: 12/13/2022] Open
Abstract
Microbiome research has grown rapidly over the past decade, with a proliferation of new methods that seek to make sense of large, complex data sets. Here, we survey two of the primary types of methods for analyzing microbiome data: read classification and metagenomic assembly, and we review some of the challenges facing these methods. All of the methods rely on public genome databases, and we also discuss the content of these databases and how their quality has a direct impact on our ability to interpret a microbiome sample.
Collapse
Affiliation(s)
| | | | - Steven L Salzberg
- Corresponding author: Steven L. Salzberg, Center for Computational Biology, Johns Hopkins University, 1900 E. Monument St., Baltimore, MD, 21205, USA. E-mail:
| |
Collapse
|
9
|
Composition Analysis and Feature Selection of the Oral Microbiota Associated with Periodontal Disease. BIOMED RESEARCH INTERNATIONAL 2018; 2018:3130607. [PMID: 30581850 PMCID: PMC6276491 DOI: 10.1155/2018/3130607] [Citation(s) in RCA: 54] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/30/2018] [Revised: 10/10/2018] [Accepted: 11/04/2018] [Indexed: 12/15/2022]
Abstract
Periodontitis is an inflammatory disease involving complex interactions between oral microorganisms and the host immune response. Understanding the structure of the microbiota community associated with periodontitis is essential for improving classifications and diagnoses of various types of periodontal diseases and will facilitate clinical decision-making. In this study, we used a 16S rRNA metagenomics approach to investigate and compare the compositions of the microbiota communities from 76 subgingival plagues samples, including 26 from healthy individuals and 50 from patients with periodontitis. Furthermore, we propose a novel feature selection algorithm for selecting features with more information from many variables with a combination of these features and machine learning methods were used to construct prediction models for predicting the health status of patients with periodontal disease. We identified a total of 12 phyla, 124 genera, and 355 species and observed differences between health- and periodontitis-associated bacterial communities at all phylogenetic levels. We discovered that the genera Porphyromonas, Treponema, Tannerella, Filifactor, and Aggregatibacter were more abundant in patients with periodontal disease, whereas Streptococcus, Haemophilus, Capnocytophaga, Gemella, Campylobacter, and Granulicatella were found at higher levels in healthy controls. Using our feature selection algorithm, random forests performed better in terms of predictive power than other methods and consumed the least amount of computational time.
Collapse
|
10
|
Mbareche H, Brisebois E, Veillette M, Duchaine C. Bioaerosol sampling and detection methods based on molecular approaches: No pain no gain. THE SCIENCE OF THE TOTAL ENVIRONMENT 2017; 599-600:2095-2104. [PMID: 28558432 DOI: 10.1016/j.scitotenv.2017.05.076] [Citation(s) in RCA: 42] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/01/2016] [Revised: 05/05/2017] [Accepted: 05/08/2017] [Indexed: 05/23/2023]
Abstract
Bioaerosols are among the less studied particles in the environment. The lack of standardization in sampling procedures, difficulties related to the effect of sampling processes on the integrity of microorganisms, and challenges associated with the application of environmental microbiology analyses and molecular and culture methods frighten many young scientists. Every microorganism has its own particularities and acts differently when aerosolized in various conditions. Because the air is an extremely biologically diluted environment, it is necessary to concentrate its content before any analysis is performed. Challenges faced when applying molecular methods to air samples reveal the need for a better standardization of approaches for cell and nucleic acid recovery, the choice of genetic markers, and interpretation of data. This paper presents a few of the limits and difficulties tackled when molecular methods are applied to bioaerosols, suggests some improvements by specifying the critical stages that should be considered when studying the microbial ecology of bioaerosols, and provides thoughtful insights on how to overcome the challenges encountered.
Collapse
Affiliation(s)
- Hamza Mbareche
- Département de biochimie, microbiologie et bio-informatique, Faculté des sciences et de génie, Université Laval, Canada; Centre de recherche de l'Institut universitaire de cardiologie et de pneumologie, Université Laval, Canada.
| | - Evelyne Brisebois
- Département de biochimie, microbiologie et bio-informatique, Faculté des sciences et de génie, Université Laval, Canada; Centre de recherche de l'Institut universitaire de cardiologie et de pneumologie, Université Laval, Canada
| | - Marc Veillette
- Centre de recherche de l'Institut universitaire de cardiologie et de pneumologie, Université Laval, Canada
| | - Caroline Duchaine
- Département de biochimie, microbiologie et bio-informatique, Faculté des sciences et de génie, Université Laval, Canada; Centre de recherche de l'Institut universitaire de cardiologie et de pneumologie, Université Laval, Canada.
| |
Collapse
|
11
|
Liu Y, Hou T, Kang B, Liu F. Unsupervised Binning of Metagenomic Assembled Contigs Using Improved Fuzzy C-Means Method. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1459-1467. [PMID: 27295684 DOI: 10.1109/tcbb.2016.2576452] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Metagenomic contigs binning is a necessary step of metagenome analysis. After assembly, the number of contigs belonging to different genomes is usually unequal. So a metagenomic contigs dataset is a kind of imbalanced dataset and traditional fuzzy c-means method (FCM) fails to handle it very well. In this paper, we will introduce an improved version of fuzzy c-means method (IFCM) into metagenomic contigs binning. First, tetranucleotide frequencies are calculated for every contig. Second, the number of bins is roughly estimated by the distribution of genome lengths of a complete set of non-draft sequenced microbial genomes from NCBI. Then, IFCM is used to cluster DNA contigs with the estimated result. Finally, a clustering validity function is utilized to determine the binning result. We tested this method on a synthetic and two real datasets and experimental results have showed the effectiveness of this method compared with other tools.
Collapse
|
12
|
Wang Y, Wang K, Lu YY, Sun F. Improving contig binning of metagenomic data using [Formula: see text] oligonucleotide frequency dissimilarity. BMC Bioinformatics 2017; 18:425. [PMID: 28931373 PMCID: PMC5607646 DOI: 10.1186/s12859-017-1835-1] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2017] [Accepted: 09/11/2017] [Indexed: 04/27/2023] Open
Abstract
BACKGROUND Metagenomics sequencing provides deep insights into microbial communities. To investigate their taxonomic structure, binning assembled contigs into discrete clusters is critical. Many binning algorithms have been developed, but their performance is not always satisfactory, especially for complex microbial communities, calling for further development. RESULTS According to previous studies, relative sequence compositions are similar across different regions of the same genome, but they differ between distinct genomes. Generally, current tools have used the normalized frequency of k-tuples directly, but this represents an absolute, not relative, sequence composition. Therefore, we attempted to model contigs using relative k-tuple composition, followed by measuring dissimilarity between contigs using [Formula: see text]. The [Formula: see text] was designed to measure the dissimilarity between two long sequences or Next-Generation Sequencing data with the Markov models of the background genomes. This method was effective in revealing group and gradient relationships between genomes, metagenomes and metatranscriptomes. With many binning tools available, we do not try to bin contigs from scratch. Instead, we developed [Formula: see text] to adjust contigs among bins based on the output of existing binning tools for a single metagenomic sample. The tool is taxonomy-free and depends only on k-tuples. To evaluate the performance of [Formula: see text], five widely used binning tools with different strategies of sequence composition or the hybrid of sequence composition and abundance were selected to bin six synthetic and real datasets, after which [Formula: see text] was applied to adjust the binning results. Our experiments showed that [Formula: see text] consistently achieves the best performance with tuple length k = 6 under the independent identically distributed (i.i.d.) background model. Using the metrics of recall, precision and ARI (Adjusted Rand Index), [Formula: see text] improves the binning performance in 28 out of 30 testing experiments (6 datasets with 5 binning tools). The [Formula: see text] is available at https://github.com/kunWangkun/d2SBin . CONCLUSIONS Experiments showed that [Formula: see text] accurately measures the dissimilarity between contigs of metagenomic reads and that relative sequence composition is more reasonable to bin the contigs. The [Formula: see text] can be applied to any existing contig-binning tools for single metagenomic samples to obtain better binning results.
Collapse
Affiliation(s)
- Ying Wang
- Department of Automation, Xiamen University, Xiamen, Fujian 361005 China
| | - Kun Wang
- Department of Automation, Xiamen University, Xiamen, Fujian 361005 China
| | - Yang Young Lu
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, CA 90089 USA
| | - Fengzhu Sun
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, CA 90089 USA
- Center for Computational Systems Biology, Fudan University, Shanghai, 200433 China
| |
Collapse
|
13
|
Abstract
Background A metagenomic sample is a set of DNA fragments, randomly extracted from multiple cells in an environment, belonging to distinct, often unknown species. Unsupervised metagenomic clustering aims at partitioning a metagenomic sample into sets that approximate taxonomic units, without using reference genomes. Since samples are large and steadily growing, space-efficient clustering algorithms are strongly needed. Results We design and implement a space-efficient algorithmic framework that solves a number of core primitives in unsupervised metagenomic clustering using just the bidirectional Burrows-Wheeler index and a union-find data structure on the set of reads. When run on a sample of total length n, with m reads of maximum length ℓ each, on an alphabet of total size σ, our algorithms take O(n(t+logσ)) time and just 2n+o(n)+O(max{ℓσlogn,K logm}) bits of space in addition to the index and to the union-find data structure, where K is a measure of the redundancy of the sample and t is the query time of the union-find data structure. Conclusions Our experimental results show that our algorithms are practical, they can exploit multiple cores by a parallel traversal of the suffix-link tree, and they are competitive both in space and in time with the state of the art.
Collapse
|
14
|
Sedlar K, Kupkova K, Provaznik I. Bioinformatics strategies for taxonomy independent binning and visualization of sequences in shotgun metagenomics. Comput Struct Biotechnol J 2016; 15:48-55. [PMID: 27980708 PMCID: PMC5148923 DOI: 10.1016/j.csbj.2016.11.005] [Citation(s) in RCA: 70] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2016] [Revised: 11/24/2016] [Accepted: 11/26/2016] [Indexed: 12/11/2022] Open
Abstract
One of main steps in a study of microbial communities is resolving their composition, diversity and function. In the past, these issues were mostly addressed by the use of amplicon sequencing of a target gene because of reasonable price and easier computational postprocessing of the bioinformatic data. With the advancement of sequencing techniques, the main focus shifted to the whole metagenome shotgun sequencing, which allows much more detailed analysis of the metagenomic data, including reconstruction of novel microbial genomes and to gain knowledge about genetic potential and metabolic capacities of whole environments. On the other hand, the output of whole metagenomic shotgun sequencing is mixture of short DNA fragments belonging to various genomes, therefore this approach requires more sophisticated computational algorithms for clustering of related sequences, commonly referred to as sequence binning. There are currently two types of binning methods: taxonomy dependent and taxonomy independent. The first type classifies the DNA fragments by performing a standard homology inference against a reference database, while the latter performs the reference-free binning by applying clustering techniques on features extracted from the sequences. In this review, we describe the strategies within the second approach. Although these strategies do not require prior knowledge, they have higher demands on the length of sequences. Besides their basic principle, an overview of particular methods and tools is provided. Furthermore, the review covers the utilization of the methods in context with the length of sequences and discusses the needs for metagenomic data preprocessing in form of initial assembly prior to binning.
Collapse
Affiliation(s)
- Karel Sedlar
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, Brno, Czech Republic
| | | | | |
Collapse
|
15
|
Brittnacher MJ, Heltshe SL, Hayden HS, Radey MC, Weiss EJ, Damman CJ, Zisman TL, Suskind DL, Miller SI. GUTSS: An Alignment-Free Sequence Comparison Method for Use in Human Intestinal Microbiome and Fecal Microbiota Transplantation Analysis. PLoS One 2016; 11:e0158897. [PMID: 27391011 PMCID: PMC4938407 DOI: 10.1371/journal.pone.0158897] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2016] [Accepted: 06/23/2016] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Comparative analysis of gut microbiomes in clinical studies of human diseases typically rely on identification and quantification of species or genes. In addition to exploring specific functional characteristics of the microbiome and potential significance of species diversity or expansion, microbiome similarity is also calculated to study change in response to therapies directed at altering the microbiome. Established ecological measures of similarity can be constructed from species abundances, however methods for calculating these commonly used ecological measures of similarity directly from whole genome shotgun (WGS) metagenomic sequence are lacking. RESULTS We present an alignment-free method for calculating similarity of WGS metagenomic sequences that is analogous to the Bray-Curtis index for species, implemented by the General Utility for Testing Sequence Similarity (GUTSS) software application. This method was applied to intestinal microbiomes of healthy young children to measure developmental changes toward an adult microbiome during the first 3 years of life. We also calculate similarity of donor and recipient microbiomes to measure establishment, or engraftment, of donor microbiota in fecal microbiota transplantation (FMT) studies focused on mild to moderate Crohn's disease. We show how a relative index of similarity to donor can be calculated as a measure of change in a patient's microbiome toward that of the donor in response to FMT. CONCLUSION Because clinical efficacy of the transplant procedure cannot be fully evaluated without analysis methods to quantify actual FMT engraftment, we developed a method for detecting change in the gut microbiome that is independent of species identification and database bias, sensitive to changes in relative abundance of the microbial constituents, and can be formulated as an index for correlating engraftment success with clinical measures of disease. More generally, this method may be applied to clinical evaluation of human microbiomes and provide potential diagnostic determination of individuals who may be candidates for specific therapies directed at alteration of the microbiome.
Collapse
Affiliation(s)
- Mitchell J. Brittnacher
- Department of Microbiology, University of Washington, Seattle, Washington, United States of America
| | - Sonya L. Heltshe
- Department of Pediatrics, University of Washington, Seattle, Washington, United States of America
- Seattle Children's Research Institute, Seattle, Washington, United States of America
| | - Hillary S. Hayden
- Department of Microbiology, University of Washington, Seattle, Washington, United States of America
| | - Matthew C. Radey
- Department of Microbiology, University of Washington, Seattle, Washington, United States of America
| | - Eli J. Weiss
- Department of Microbiology, University of Washington, Seattle, Washington, United States of America
| | - Christopher J. Damman
- Division of Gastroenterology, University of Washington, Seattle, Washington, United States of America
| | - Timothy L. Zisman
- Division of Gastroenterology, University of Washington, Seattle, Washington, United States of America
| | - David L. Suskind
- Department of Pediatrics, University of Washington, Seattle, Washington, United States of America
- Seattle Children’s Hospital, Seattle, Washington, United States of America
| | - Samuel I. Miller
- Department of Microbiology, University of Washington, Seattle, Washington, United States of America
- Department of Medicine, University of Washington, Seattle, Washington, United States of America
- Department of Immunology, University of Washington, Seattle, Washington, United States of America
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| |
Collapse
|
16
|
Saw JH, Spang A, Zaremba-Niedzwiedzka K, Juzokaite L, Dodsworth JA, Murugapiran SK, Colman DR, Takacs-Vesbach C, Hedlund BP, Guy L, Ettema TJG. Exploring microbial dark matter to resolve the deep archaeal ancestry of eukaryotes. Philos Trans R Soc Lond B Biol Sci 2016; 370:20140328. [PMID: 26323759 PMCID: PMC4571567 DOI: 10.1098/rstb.2014.0328] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
The origin of eukaryotes represents an enigmatic puzzle, which is still lacking a number of essential pieces. Whereas it is currently accepted that the process of eukaryogenesis involved an interplay between a host cell and an alphaproteobacterial endosymbiont, we currently lack detailed information regarding the identity and nature of these players. A number of studies have provided increasing support for the emergence of the eukaryotic host cell from within the archaeal domain of life, displaying a specific affiliation with the archaeal TACK superphylum. Recent studies have shown that genomic exploration of yet-uncultivated archaea, the so-called archaeal ‘dark matter’, is able to provide unprecedented insights into the process of eukaryogenesis. Here, we provide an overview of state-of-the-art cultivation-independent approaches, and demonstrate how these methods were used to obtain draft genome sequences of several novel members of the TACK superphylum, including Lokiarchaeum, two representatives of the Miscellaneous Crenarchaeotal Group (Bathyarchaeota), and a Korarchaeum-related lineage. The maturation of cultivation-independent genomics approaches, as well as future developments in next-generation sequencing technologies, will revolutionize our current view of microbial evolution and diversity, and provide profound new insights into the early evolution of life, including the enigmatic origin of the eukaryotic cell.
Collapse
Affiliation(s)
- Jimmy H Saw
- Department of Cell and Molecular Biology, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Anja Spang
- Department of Cell and Molecular Biology, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | | | - Lina Juzokaite
- Department of Cell and Molecular Biology, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Jeremy A Dodsworth
- School of Life Sciences, University of Nevada Las Vegas, Las Vegas, NV, USA
| | | | - Dan R Colman
- Department of Biology, University of New Mexico, Albuquerque, NM, USA
| | | | - Brian P Hedlund
- School of Life Sciences, University of Nevada Las Vegas, Las Vegas, NV, USA
| | - Lionel Guy
- Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
| | - Thijs J G Ettema
- Department of Cell and Molecular Biology, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| |
Collapse
|
17
|
Weitschek E, Cunial F, Felici G. LAF: Logic Alignment Free and its application to bacterial genomes classification. BioData Min 2015; 8:39. [PMID: 26664519 PMCID: PMC4673791 DOI: 10.1186/s13040-015-0073-1] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2015] [Accepted: 11/30/2015] [Indexed: 12/24/2022] Open
Abstract
Alignment-free algorithms can be used to estimate the similarity of biological sequences and hence are often applied to the phylogenetic reconstruction of genomes. Most of these algorithms rely on comparing the frequency of all the distinct substrings of fixed length (k-mers) that occur in the analyzed sequences. In this paper, we present Logic Alignment Free (LAF), a method that combines alignment-free techniques and rule-based classification algorithms in order to assign biological samples to their taxa. This method searches for a minimal subset of k-mers whose relative frequencies are used to build classification models as disjunctive-normal-form logic formulas (if-then rules). We apply LAF successfully to the classification of bacterial genomes to their corresponding taxonomy. In particular, we succeed in obtaining reliable classification at different taxonomic levels by extracting a handful of rules, each one based on the frequency of just few k-mers. State of the art methods to adjust the frequency of k-mers to the character distribution of the underlying genomes have negligible impact on classification performance, suggesting that the signal of each class is strong and that LAF is effective in identifying it.
Collapse
Affiliation(s)
- Emanuel Weitschek
- Department of Engineering, Uninettuno International University, Corso Vittorio Emanuele II, 39, Rome, 00186 Italy ; Institute of Systems Analysis and Computer Science "A. Ruberti", National Research Council, Via dei Taurini 19, Rome, 00185 Italy
| | - Fabio Cunial
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, P.O. Box 68 (Gustaf Hällströmin katu 2b), Helsinki, FI-00014 Finland
| | - Giovanni Felici
- Institute of Systems Analysis and Computer Science "A. Ruberti", National Research Council, Via dei Taurini 19, Rome, 00185 Italy
| |
Collapse
|
18
|
Ju F, Zhang T. Experimental Design and Bioinformatics Analysis for the Application of Metagenomics in Environmental Sciences and Biotechnology. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2015; 49:12628-40. [PMID: 26451629 DOI: 10.1021/acs.est.5b03719] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Recent advances in DNA sequencing technologies have prompted the widespread application of metagenomics for the investigation of novel bioresources (e.g., industrial enzymes and bioactive molecules) and unknown biohazards (e.g., pathogens and antibiotic resistance genes) in natural and engineered microbial systems across multiple disciplines. This review discusses the rigorous experimental design and sample preparation in the context of applying metagenomics in environmental sciences and biotechnology. Moreover, this review summarizes the principles, methodologies, and state-of-the-art bioinformatics procedures, tools and database resources for metagenomics applications and discusses two popular strategies (analysis of unassembled reads versus assembled contigs/draft genomes) for quantitative or qualitative insights of microbial community structure and functions. Overall, this review aims to facilitate more extensive application of metagenomics in the investigation of uncultured microorganisms, novel enzymes, microbe-environment interactions, and biohazards in biotechnological applications where microbial communities are engineered for bioenergy production, wastewater treatment, and bioremediation.
Collapse
Affiliation(s)
- Feng Ju
- Environmental Biotechnology Lab, Department of Civil Engineering, The University of Hong Kong , Hong Kong SRA, China
| | - Tong Zhang
- Environmental Biotechnology Lab, Department of Civil Engineering, The University of Hong Kong , Hong Kong SRA, China
| |
Collapse
|
19
|
Wang M, Doak TG, Ye Y. Subtractive assembly for comparative metagenomics, and its application to type 2 diabetes metagenomes. Genome Biol 2015; 16:243. [PMID: 26527161 PMCID: PMC4630832 DOI: 10.1186/s13059-015-0804-0] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2014] [Accepted: 10/09/2015] [Indexed: 12/18/2022] Open
Abstract
Comparative metagenomics remains challenging due to the size and complexity of metagenomic datasets. Here we introduce subtractive assembly, a de novo assembly approach for comparative metagenomics that directly assembles only the differential reads that distinguish between two groups of metagenomes. Using simulated datasets, we show it improves both the efficiency of the assembly and the assembly quality of the differential genomes and genes. Further, its application to type 2 diabetes (T2D) metagenomic datasets reveals clear signatures of the T2D gut microbiome, revealing new phylogenetic and functional features of the gut microbial communities associated with T2D.
Collapse
Affiliation(s)
- Mingjie Wang
- School of Informatics and Computing, Indiana University, Bloomington, IN, 47405, USA.
| | - Thomas G Doak
- Department of Biology, Indiana University, Bloomington, IN, 47405, USA. .,National Center for Genome Analysis Support, Indiana University, Bloomington, IN, 47401, USA.
| | - Yuzhen Ye
- School of Informatics and Computing, Indiana University, Bloomington, IN, 47405, USA.
| |
Collapse
|
20
|
Wang Z, Liu L, Guo F, Zhang T. Deciphering Cyanide-Degrading Potential of Bacterial Community Associated with the Coking Wastewater Treatment Plant with a Novel Draft Genome. MICROBIAL ECOLOGY 2015; 70:701-709. [PMID: 25910603 DOI: 10.1007/s00248-015-0611-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/14/2014] [Accepted: 03/29/2015] [Indexed: 06/04/2023]
Abstract
Biotreatment processes fed with coking wastewater often encounter insufficient removal of pollutants, such as ammonia, phenols, and polycyclic aromatic hydrocarbons (PAHs), especially for cyanides. However, only a limited number of bacterial species in pure cultures have been confirmed to metabolize cyanides, which hinders the improvement of these processes. In this study, a microbial community of activated sludge enriched in a coking wastewater treatment plant was analyzed using 454 pyrosequencing and Illumina sequencing to characterize the potential cyanide-degrading bacteria. According to the classification of these pyro-tags, targeting V3/V4 regions of 16S rRNA gene, half of them were assigned to the family Xanthomonadaceae, implying that Xanthomonadaceae bacteria are well-adapted to coking wastewater. A nearly complete draft genome of the dominant bacterium was reconstructed from metagenome of this community to explore cyanide metabolism based on analysis of the genome. The assembled 16S rRNA gene from this draft genome showed that this bacterium was a novel species of Thermomonas within Xanthomonadaceae, which was further verified by comparative genomics. The annotation using KEGG and Pfam identified genes related to cyanide metabolism, including genes responsible for the iron-harvesting system, cyanide-insensitive terminal oxidase, cyanide hydrolase/nitrilase, and thiosulfate:cyanide transferase. Phylogenetic analysis showed that these genes had homologs in previously identified genomes of bacteria within Xanthomonadaceae and even presented similar gene cassettes, thus implying an inherent cyanide-decomposing potential. The findings of this study expand our knowledge about the bacterial degradation of cyanide compounds and will be helpful in the remediation of cyanides contamination.
Collapse
Affiliation(s)
- Zhiping Wang
- Environmental Biotechnology Laboratory, The University of Hong Kong, Pok Fu Lam, Hong Kong, Hong Kong
- School of Environmental Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
| | - Lili Liu
- Environmental Biotechnology Laboratory, The University of Hong Kong, Pok Fu Lam, Hong Kong, Hong Kong
- State Environmental Protection Key Laboratory of Environmental Risk Assessment and Control on Chemical Process, East China University of Science and Technology, Shanghai, China
| | - Feng Guo
- Environmental Biotechnology Laboratory, The University of Hong Kong, Pok Fu Lam, Hong Kong, Hong Kong
| | - Tong Zhang
- Environmental Biotechnology Laboratory, The University of Hong Kong, Pok Fu Lam, Hong Kong, Hong Kong.
| |
Collapse
|
21
|
MetaObtainer: A Tool for Obtaining Specified Species from Metagenomic Reads of Next-generation Sequencing. Interdiscip Sci 2015; 7:405-13. [PMID: 26293485 DOI: 10.1007/s12539-015-0281-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2014] [Revised: 07/26/2014] [Accepted: 08/07/2014] [Indexed: 10/23/2022]
Abstract
Reads classification is an important fundamental problem in metagenomics study. With the development of next-generation sequencing, metagenome samples can be generated using much less money and time. However, the short reads generated by next-generation sequencing make the problem of reads classification much more difficult than before. None of the existing tools can assign NGS short reads to each genome accurately, which limit their use in real application. Fortunately, in many applications, it is meaningless to separate all the species in the metagenome sample from each other. That is because we usually only focus on some specified species categories in the sample and do not care about the others. There is no existing tool that is designed technically for obtaining specified species from short metagenome reads generated by next-generation sequencing. In this paper, we propose a tool named MetaObtainer to obtain the specified species from next-generation sequencing short reads. The tool synthesizes some of newest technologies for processing of short reads, so it can have better performance than other tools. It can (1) deal with next-generation sequencing reads which are shorter than 100 bp with very high accuracy (both of precision and recall are more than 90%); (2) find unknown species using the reference genomes of species which are similar with it; (3) perform well when reads of specified species are very few in the dataset; (4) handle genomes of similar abundance levels as well as different abundance levels (1:10); and (5) obtain multiple species categories from metagenome sample.
Collapse
|
22
|
Diversity and functions of bacterial community in drinking water biofilms revealed by high-throughput sequencing. Sci Rep 2015; 5:10044. [PMID: 26067561 PMCID: PMC4464384 DOI: 10.1038/srep10044] [Citation(s) in RCA: 61] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2014] [Accepted: 03/17/2015] [Indexed: 12/11/2022] Open
Abstract
The development of biofilms in drinking water (DW) systems may cause various problems to water quality. To investigate the community structure of biofilms on different pipe materials and the global/specific metabolic functions of DW biofilms, PCR-based 454 pyrosequencing data for 16S rRNA genes and Illumina metagenomic data were generated and analysed. Considerable differences in bacterial diversity and taxonomic structure were identified between biofilms formed on stainless steel and biofilms formed on plastics, indicating that the metallic materials facilitate the formation of higher diversity biofilms. Moreover, variations in several dominant genera were observed during biofilm formation. Based on PCA analysis, the global functions in the DW biofilms were similar to other DW metagenomes. Beyond the global functions, the occurrences and abundances of specific protective genes involved in the glutathione metabolism, the SoxRS system, the OxyR system, RpoS regulated genes, and the production/degradation of extracellular polymeric substances were also evaluated. A near-complete and low-contamination draft genome was constructed from the metagenome of the DW biofilm, based on the coverage and tetranucleotide frequencies, and identified as a Bradyrhizobiaceae-like bacterium according to a phylogenetic analysis. Our findings provide new insight into DW biofilms, especially in terms of their metabolic functions.
Collapse
|
23
|
Zhang R, Cheng Z, Guan J, Zhou S. Exploiting topic modeling to boost metagenomic reads binning. BMC Bioinformatics 2015; 16 Suppl 5:S2. [PMID: 25859745 PMCID: PMC4402587 DOI: 10.1186/1471-2105-16-s5-s2] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND With the rapid development of high-throughput technologies, researchers can sequence the whole metagenome of a microbial community sampled directly from the environment. The assignment of these metagenomic reads into different species or taxonomical classes is a vital step for metagenomic analysis, which is referred to as binning of metagenomic data. RESULTS In this paper, we propose a new method TM-MCluster for binning metagenomic reads. First, we represent each metagenomic read as a set of "k-mers" with their frequencies occurring in the read. Then, we employ a probabilistic topic model -- the Latent Dirichlet Allocation (LDA) model to the reads, which generates a number of hidden "topics" such that each read can be represented by a distribution vector of the generated topics. Finally, as in the MCluster method, we apply SKWIC -- a variant of the classical K-means algorithm with automatic feature weighting mechanism to cluster these reads represented by topic distributions. CONCLUSIONS Experiments show that the new method TM-MCluster outperforms major existing methods, including AbundanceBin, MetaCluster 3.0/5.0 and MCluster. This result indicates that the exploitation of topic modeling can effectively improve the binning performance of metagenomic reads.
Collapse
|
24
|
Ghosh TS, Mehra V, Mande SS. Grid-Assembly: An oligonucleotide composition-based partitioning strategy to aid metagenomic sequence assembly. J Bioinform Comput Biol 2015; 13:1541004. [PMID: 25790784 DOI: 10.1142/s0219720015410048] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
Metagenomics approach involves extraction, sequencing and characterization of the genomic content of entire community of microbes present in a given environment. In contrast to genomic data, accurate assembly of metagenomic sequences is a challenging task. Given the huge volume and the diverse taxonomic origin of metagenomic sequences, direct application of single genome assembly methods on metagenomes are likely to not only lead to an immense increase in requirements of computational infrastructure, but also result in the formation of chimeric contigs. A strategy to address the above challenge would be to partition metagenomic sequence datasets into clusters and assemble separately the sequences in individual clusters using any single-genome assembly method. The current study presents such an approach that uses tetranucleotide usage patterns to first represent sequences as points in a three dimensional (3D) space. The 3D space is subsequently partitioned into "Grids". Sequences within overlapping grids are then progressively assembled using any available assembler. We demonstrate the applicability of the current Grid-Assembly method using various categories of assemblers as well as different simulated metagenomic datasets. Validation results indicate that the Grid-Assembly approach helps in improving the overall quality of assembly, in terms of the purity and volume of the assembled contigs.
Collapse
Affiliation(s)
- Tarini Shankar Ghosh
- Biosciences R&D Division, TCS Innovation Labs, 54-B Hadapsar Industrial Estate, Pune, Maharashtra 411013, India
| | | | | |
Collapse
|
25
|
Vinh LV, Lang TV, Binh LT, Hoai TV. A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads. Algorithms Mol Biol 2015; 10:2. [PMID: 25648210 PMCID: PMC4304631 DOI: 10.1186/s13015-014-0030-4] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2014] [Accepted: 10/20/2014] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND Metagenomics is the study of genetic materials derived directly from complex microbial samples, instead of from culture. One of the crucial steps in metagenomic analysis, referred to as "binning", is to separate reads into clusters that represent genomes from closely related organisms. Among the existing binning methods, unsupervised methods base the classification on features extracted from reads, and especially taking advantage in case of the limitation of reference database availability. However, their performance, under various aspects, is still being investigated by recent theoretical and empirical studies. The one addressed in this paper is among those efforts to enhance the accuracy of the classification. RESULTS This paper presents an unsupervised algorithm, called BiMeta, for binning of reads from different species in a metagenomic dataset. The algorithm consists of two phases. In the first phase of the algorithm, reads are grouped into groups based on overlap information between the reads. The second phase merges the groups by using an observation on l-mer frequency distribution of sets of non-overlapping reads. The experimental results on simulated and real datasets showed that BiMeta outperforms three state-of-the-art binning algorithms for both short and long reads (≥700 b p) datasets. CONCLUSIONS This paper developed a novel and efficient algorithm for binning of metagenomic reads, which does not require any reference database. The software implementing the algorithm and all test datasets mentioned in this paper can be downloaded at http://it.hcmute.edu.vn/bioinfo/bimeta/index.htm.
Collapse
Affiliation(s)
- Le Van Vinh
- />Faculty of Computer Science and Engineering, HCMC University of Technology, 268 Ly Thuong Kiet, Q10, Ho Chi Minh City, Vietnam
| | - Tran Van Lang
- />Institute of Applied Mechanics and Informatics, Vietnam Academy of Science and Technology (VAST), 01 Mac Dinh Chi, Q1, Ho Chi Minh City, Vietnam
- />Faculty of Information Technology, Lac Hong University, 10 Huynh Van Nghe, Bien Hoa, Dong Nai Vietnam
| | - Le Thanh Binh
- />Institute of Biotechnology, Vietnam Academy of Science and Technology (VAST), 18 Hoang Quoc Viet, Cau Giay, Ha Noi Vietnam
| | - Tran Van Hoai
- />Faculty of Computer Science and Engineering, HCMC University of Technology, 268 Ly Thuong Kiet, Q10, Ho Chi Minh City, Vietnam
| |
Collapse
|
26
|
Abram F. Systems-based approaches to unravel multi-species microbial community functioning. Comput Struct Biotechnol J 2014; 13:24-32. [PMID: 25750697 PMCID: PMC4348430 DOI: 10.1016/j.csbj.2014.11.009] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2014] [Revised: 11/25/2014] [Accepted: 11/26/2014] [Indexed: 01/24/2023] Open
Abstract
Some of the most transformative discoveries promising to enable the resolution of this century's grand societal challenges will most likely arise from environmental science and particularly environmental microbiology and biotechnology. Understanding how microbes interact in situ, and how microbial communities respond to environmental changes remains an enormous challenge for science. Systems biology offers a powerful experimental strategy to tackle the exciting task of deciphering microbial interactions. In this framework, entire microbial communities are considered as metaorganisms and each level of biological information (DNA, RNA, proteins and metabolites) is investigated along with in situ environmental characteristics. In this way, systems biology can help unravel the interactions between the different parts of an ecosystem ultimately responsible for its emergent properties. Indeed each level of biological information provides a different level of characterisation of the microbial communities. Metagenomics, metatranscriptomics, metaproteomics, metabolomics and SIP-omics can be employed to investigate collectively microbial community structure, potential, function, activity and interactions. Omics approaches are enabled by high-throughput 21st century technologies and this review will discuss how their implementation has revolutionised our understanding of microbial communities.
Collapse
Affiliation(s)
- Florence Abram
- Functional Environmental Microbiology, School of Natural Sciences, National University of Ireland Galway, University Road, Galway, Ireland
| |
Collapse
|
27
|
Wang Z, Guo F, Mao Y, Xia Y, Zhang T. Metabolic characteristics of a glycogen-accumulating organism in Defluviicoccus cluster II revealed by comparative genomics. MICROBIAL ECOLOGY 2014; 68:716-728. [PMID: 24889288 DOI: 10.1007/s00248-014-0440-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/30/2014] [Accepted: 05/20/2014] [Indexed: 06/03/2023]
Abstract
Glycogen-accumulating organisms (GAOs) may compete with phosphate-accumulating organisms (PAOs) for short-chain fatty acids (VFAs) in anaerobic polyhydroxyalkanoates (PHA) synthesis, but no consequently aerobic polyphosphate accumulation in enhanced biological phosphorus removal (EBPR) process, thus deteriorating the EBPR process. They are detected frequently in the deteriorated EBPR process, but their metabolisms are still far from our comprehensions for there is seldom pure culture. In this study, a nearly complete draft genome of a GAOs in Defluviicoccus cluster II, GAO-HK, is recruited from the metagenome of activated sludge in a full-scale industrial anoxic/aerobic wastewater plant. Comparative genomics reveal similar metabolisms of PHA and glycogen in GAOs of GAO-HK, Defluviicoccus tetraformis TFO71 (TFO71) and Competibacter phosphatis clade IIA (CPIIA), and PAOs of Accumulibacter clade IIA UW-1 (UW-1) and Tetrasphaera elongata Lp2 (Lp2). Although there are similar gene cassettes related with polyphosphate metabolism in these GAOs and PAOs, especially for Defluviicoccus-relative bacteria and UW-1, ppk1 in GAOs are diverse from those in the identified PAOs, implying the difference of polyphosphate metabolism in GAOs and PAOs. Additionally, genes related to the dissimilatory denitrification are absent in TFO71 and GAO-HK, implying that additional nitrate or nitrite may favor PAOs over Defluviicoccus-relative GAOs. Therefore, PAOs suffering from competition of Defluviicoccus-relative GAOs might be rescued with the additional nitrate/nitrite, which is important to improve the stability of EBPR processes.
Collapse
Affiliation(s)
- Zhiping Wang
- Environmental Biotechnology Laboratory, The University of Hong Kong, Hong Kong, SAR, China
| | | | | | | | | |
Collapse
|
28
|
Wang Z, Guo F, Liu L, Zhang T. Evidence of carbon fixation pathway in a bacterium from candidate phylum SBR1093 revealed with genomic analysis. PLoS One 2014; 9:e109571. [PMID: 25310003 PMCID: PMC4195664 DOI: 10.1371/journal.pone.0109571] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2014] [Accepted: 08/11/2014] [Indexed: 11/20/2022] Open
Abstract
Autotrophic CO2 fixation is the most important biotransformation process in the biosphere. Research focusing on the diversity and distribution of relevant autotrophs is significant to our comprehension of the biosphere. In this study, a draft genome of a bacterium from candidate phylum SBR1093 was reconstructed with the metagenome of an industrial activated sludge. Based on comparative genomics, this autotrophy may occur via a newly discovered carbon fixation path, the hydroxypropionate-hydroxybutyrate (HPHB) cycle, which was demonstrated in a previous work to be uniquely possessed by some genera from Archaea. This bacterium possesses all of the thirteen enzymes required for the HPHB cycle; these enzymes share 30∼50% identity with those in the autotrophic species of Archaea that undergo the HPHB cycle and 30∼80% identity with the corresponding enzymes of the mixotrophic species within Bradyrhizobiaceae. Thus, this bacterium might have an autotrophic growth mode in certain conditions. A phylogenetic analysis based on the 16S rRNA gene reveals that the phylotypes within candidate phylum SBR1093 are primarily clustered into 5 clades with a shallow branching pattern. This bacterium is clustered with phylotypes from organically contaminated environments, implying a demand for organics in heterotrophic metabolism. Considering the types of regulators, such as FnR, Fur, and ArsR, this bacterium might be a facultative aerobic mixotroph with potential multi-antibiotic and heavy metal resistances. This is the first report on Bacteria that may perform potential carbon fixation via the HPHB cycle, thus may expand our knowledge of the distribution and importance of the HPHB cycle in the biosphere.
Collapse
Affiliation(s)
- Zhiping Wang
- Environmental Biotechnology Laboratory, The University of Hong Kong, Hong Kong
- School of Environmental Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
| | - Feng Guo
- Environmental Biotechnology Laboratory, The University of Hong Kong, Hong Kong
| | - Lili Liu
- Environmental Biotechnology Laboratory, The University of Hong Kong, Hong Kong
- State Environmental Protection Key Laboratory of Environmental Risk Assessment and Control on Chemical Process, East China University of Science and Technology, Shanghai, China
| | - Tong Zhang
- Environmental Biotechnology Laboratory, The University of Hong Kong, Hong Kong
- * E-mail:
| |
Collapse
|
29
|
Budowle B, Connell ND, Bielecka-Oder A, Colwell RR, Corbett CR, Fletcher J, Forsman M, Kadavy DR, Markotic A, Morse SA, Murch RS, Sajantila A, Schmedes SE, Ternus KL, Turner SD, Minot S. Validation of high throughput sequencing and microbial forensics applications. INVESTIGATIVE GENETICS 2014; 5:9. [PMID: 25101166 PMCID: PMC4123828 DOI: 10.1186/2041-2223-5-9] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/08/2014] [Accepted: 07/09/2014] [Indexed: 01/29/2023]
Abstract
High throughput sequencing (HTS) generates large amounts of high quality sequence data for microbial genomics. The value of HTS for microbial forensics is the speed at which evidence can be collected and the power to characterize microbial-related evidence to solve biocrimes and bioterrorist events. As HTS technologies continue to improve, they provide increasingly powerful sets of tools to support the entire field of microbial forensics. Accurate, credible results allow analysis and interpretation, significantly influencing the course and/or focus of an investigation, and can impact the response of the government to an attack having individual, political, economic or military consequences. Interpretation of the results of microbial forensic analyses relies on understanding the performance and limitations of HTS methods, including analytical processes, assays and data interpretation. The utility of HTS must be defined carefully within established operating conditions and tolerances. Validation is essential in the development and implementation of microbial forensics methods used for formulating investigative leads attribution. HTS strategies vary, requiring guiding principles for HTS system validation. Three initial aspects of HTS, irrespective of chemistry, instrumentation or software are: 1) sample preparation, 2) sequencing, and 3) data analysis. Criteria that should be considered for HTS validation for microbial forensics are presented here. Validation should be defined in terms of specific application and the criteria described here comprise a foundation for investigators to establish, validate and implement HTS as a tool in microbial forensics, enhancing public safety and national security.
Collapse
Affiliation(s)
- Bruce Budowle
- Department of Molecular and Medical Genetics, Institute of Applied Genetics, University of North Texas Health Science Center, Fort Worth, Texas, USA
- Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi Arabia
| | - Nancy D Connell
- Rutgers New Jersey Medical School, Center for Biodefense, Rutgers University, Newark, New Jersey, USA
| | - Anna Bielecka-Oder
- Department of Epidemiology, The General K. Kaczkowski Military Institute of Hygiene and Epidemiology, Warsaw, Poland
| | - Rita R Colwell
- CosmosID®, 387 Technology Dr, College Park, MD, USA
- Maryland Pathogen Research Institute, University of Maryland, College Park, MD, USA
- University of Maryland Institute for Advanced Computer Studies, University of Maryland, College Park, MD, USA
- Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, USA
| | - Cindi R Corbett
- Bioforensics Assay Development and DiagnosticsSection, Science Technology and Core Services Division, National Microbiology Laboratory, Winnipeg, MB, Canada
- Department of Medical Microbiology, University of Manitoba, Winnipeg, Canada
| | - Jacqueline Fletcher
- National Institute for Microbial Forensics & Food and Agricultural Biosecurity, Oklahoma State University, Stillwater, OK, USA
| | - Mats Forsman
- Division of CBRN Defence and Security, Swedish Defence Research Agency, Umeå, Sweden
| | | | - Alemka Markotic
- University Hospital for Infectious Diseases “Fran Mihaljevic” and Medical School University of Rijeka, Zagreb, Croatia
| | - Stephen A Morse
- Division of Foodborne, Waterborne, and Environmental Diseases, National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, Atlanta, Georgia
| | | | - Antti Sajantila
- Department of Molecular and Medical Genetics, Institute of Applied Genetics, University of North Texas Health Science Center, Fort Worth, Texas, USA
- Department of Forensic Medicine, Hjelt Institute, University of Helsinki, Helsinki, Finland
| | - Sarah E Schmedes
- Department of Molecular and Medical Genetics, Institute of Applied Genetics, University of North Texas Health Science Center, Fort Worth, Texas, USA
| | | | - Stephen D Turner
- Public Health Sciences, Bioinformatics Core Director, University of Virginia School of Medicine, Charlottesville, VA, USA
| | | |
Collapse
|
30
|
Wang Y, Leung HCM, Yiu SM, Chin FYL. MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning. BMC Genomics 2014; 15 Suppl 1:S12. [PMID: 24564377 PMCID: PMC4046714 DOI: 10.1186/1471-2164-15-s1-s12] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Taxonomic annotation of reads is an important problem in metagenomic analysis. Existing annotation tools, which rely on the approach of aligning each read to the taxonomic structure, are unable to annotate many reads efficiently and accurately as reads (~100 bp) are short and most of them come from unknown genomes. Previous work has suggested assembling the reads to make longer contigs before annotation. More reads/contigs can be annotated as a longer contig (in Kbp) can be aligned to a taxon even if it is from an unknown species as long as it contains a conserved region of that taxon. Unfortunately existing metagenomic assembly tools are not mature enough to produce long enough contigs. Binning tries to group reads/contigs of similar species together. Intuitively, reads in the same group (cluster) should be annotated to the same taxon and these reads altogether should cover a significant portion of the genome alleviating the problem of short contigs if the quality of binning is high. However, no existing work has tried to use binning results to help solve the annotation problem. This work explores this direction. RESULTS In this paper, we describe MetaCluster-TA, an assembly-assisted binning-based annotation tool which relies on an innovative idea of annotating binned reads instead of aligning each read or contig to the taxonomic structure separately. We propose the novel concept of the 'virtual contig' (which can be up to 10 Kb in length) to represent a set of reads and then represent each cluster as a set of 'virtual contigs' (which together can be total up to 1 Mb in length) for annotation. MetaCluster-TA can outperform widely-used MEGAN4 and can annotate (1) more reads since the virtual contigs are much longer; (2) more accurately since each cluster of long virtual contigs contains global information of the sampled genome which tends to be more accurate than short reads or assembled contigs which contain only local information of the genome; and (3) more efficiently since there are much fewer long virtual contigs to align than short reads. MetaCluster-TA outperforms MetaCluster 5.0 as a binning tool since binning itself can be more sensitive and precise given long virtual contigs and the binning results can be improved using the reference taxonomic database. CONCLUSIONS MetaCluster-TA can outperform widely-used MEGAN4 and can annotate more reads with higher accuracy and higher efficiency. It also outperforms MetaCluster 5.0 as a binning tool.
Collapse
Affiliation(s)
- Yi Wang
- Department of Computer Science, The University of Hong Kong, Kragujevac, Hong Kong
| | - Henry Chi Ming Leung
- Department of Computer Science, The University of Hong Kong, Kragujevac, Hong Kong
| | - Siu Ming Yiu
- Department of Computer Science, The University of Hong Kong, Kragujevac, Hong Kong
| | - Francis Yuk Lun Chin
- Department of Computer Science, The University of Hong Kong, Kragujevac, Hong Kong
| |
Collapse
|
31
|
Liao R, Zhang R, Guan J, Zhou S. A New Unsupervised Binning Approach for Metagenomic Sequences Based on N-grams and Automatic Feature Weighting. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:42-54. [PMID: 26355506 DOI: 10.1109/tcbb.2013.137] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The rapid development of high-throughput technologies enables researchers to sequence the whole metagenome of a microbial community sampled directly from the environment. The assignment of these sequence reads into different species or taxonomical classes is a crucial step for metagenomic analysis, which is referred to as binning of metagenomic data. Most traditional binning methods rely on known reference genomes for accurate assignment of the sequence reads, therefore cannot classify reads from unknown species without the help of close references. To overcome this drawback, unsupervised learning based approaches have been proposed, which need not any known species' reference genome for help. In this paper, we introduce a novel unsupervised method called MCluster for binning metagenomic sequences. This method uses N-grams to extract sequence features and utilizes automatic feature weighting to improve the performance of the basic K-means clustering algorithm. We evaluate MCluster on a variety of simulated data sets and a real data set, and compare it with three latest binning methods: AbundanceBin, MetaCluster 3.0, and MetaCluster 5.0. Experimental results show that MCluster achieves obviously better overall performance (F-measure) than AbundanceBin and MetaCluster 3.0 on long metagenomic reads (≥800 bp); while compared with MetaCluster 5.0, MCluster obtains a larger sensitivity, and a comparable yet more stable F-measure on short metagenomic reads (<300 bp). This suggests that MCluster can serve as a promising tool for effectively binning metagenomic sequences.
Collapse
|
32
|
Wu YW, Tang YH, Tringe SG, Simmons BA, Singer SW. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. MICROBIOME 2014; 2:26. [PMID: 25136443 PMCID: PMC4129434 DOI: 10.1186/2049-2618-2-26] [Citation(s) in RCA: 392] [Impact Index Per Article: 39.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/04/2014] [Accepted: 06/04/2014] [Indexed: 05/11/2023]
Abstract
BACKGROUND Recovering individual genomes from metagenomic datasets allows access to uncultivated microbial populations that may have important roles in natural and engineered ecosystems. Understanding the roles of these uncultivated populations has broad application in ecology, evolution, biotechnology and medicine. Accurate binning of assembled metagenomic sequences is an essential step in recovering the genomes and understanding microbial functions. RESULTS We have developed a binning algorithm, MaxBin, which automates the binning of assembled metagenomic scaffolds using an expectation-maximization algorithm after the assembly of metagenomic sequencing reads. Binning of simulated metagenomic datasets demonstrated that MaxBin had high levels of accuracy in binning microbial genomes. MaxBin was used to recover genomes from metagenomic data obtained through the Human Microbiome Project, which demonstrated its ability to recover genomes from real metagenomic datasets with variable sequencing coverages. Application of MaxBin to metagenomes obtained from microbial consortia adapted to grow on cellulose allowed genomic analysis of new, uncultivated, cellulolytic bacterial populations, including an abundant myxobacterial population distantly related to Sorangium cellulosum that possessed a much smaller genome (5 MB versus 13 to 14 MB) but has a more extensive set of genes for biomass deconstruction. For the cellulolytic consortia, the MaxBin results were compared to binning using emergent self-organizing maps (ESOMs) and differential coverage binning, demonstrating that it performed comparably to these methods but had distinct advantages in automation, resolution of related genomes and sensitivity. CONCLUSIONS The automatic binning software that we developed successfully classifies assembled sequences in metagenomic datasets into recovered individual genomes. The isolation of dozens of species in cellulolytic microbial consortia, including a novel species of myxobacteria that has the smallest genome among all sequenced aerobic myxobacteria, was easily achieved using the binning software. This work demonstrates that the processes required for recovering genomes from assembled metagenomic datasets can be readily automated, an important advance in understanding the metabolic potential of microbes in natural environments. MaxBin is available at https://sourceforge.net/projects/maxbin/.
Collapse
Affiliation(s)
- Yu-Wei Wu
- Joint BioEnergy Institute, Emeryville, CA 94608, USA
- Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Yung-Hsu Tang
- Joint BioEnergy Institute, Emeryville, CA 94608, USA
- City College of San Francisco, San Francisco, CA 94112, USA
| | - Susannah G Tringe
- Joint Genome Institute, Walnut Creek, CA 94598, USA
- Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Blake A Simmons
- Joint BioEnergy Institute, Emeryville, CA 94608, USA
- Biological and Materials Sciences Center, Sandia National Laboratories, Livermore, CA 94551, USA
| | - Steven W Singer
- Joint BioEnergy Institute, Emeryville, CA 94608, USA
- Earth Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| |
Collapse
|
33
|
Carr R, Shen-Orr SS, Borenstein E. Reconstructing the genomic content of microbiome taxa through shotgun metagenomic deconvolution. PLoS Comput Biol 2013; 9:e1003292. [PMID: 24146609 PMCID: PMC3798274 DOI: 10.1371/journal.pcbi.1003292] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2013] [Accepted: 09/06/2013] [Indexed: 01/21/2023] Open
Abstract
Metagenomics has transformed our understanding of the microbial world, allowing researchers to bypass the need to isolate and culture individual taxa and to directly characterize both the taxonomic and gene compositions of environmental samples. However, associating the genes found in a metagenomic sample with the specific taxa of origin remains a critical challenge. Existing binning methods, based on nucleotide composition or alignment to reference genomes allow only a coarse-grained classification and rely heavily on the availability of sequenced genomes from closely related taxa. Here, we introduce a novel computational framework, integrating variation in gene abundances across multiple samples with taxonomic abundance data to deconvolve metagenomic samples into taxa-specific gene profiles and to reconstruct the genomic content of community members. This assembly-free method is not bounded by various factors limiting previously described methods of metagenomic binning or metagenomic assembly and represents a fundamentally different approach to metagenomic-based genome reconstruction. An implementation of this framework is available at http://elbo.gs.washington.edu/software.html. We first describe the mathematical foundations of our framework and discuss considerations for implementing its various components. We demonstrate the ability of this framework to accurately deconvolve a set of metagenomic samples and to recover the gene content of individual taxa using synthetic metagenomic samples. We specifically characterize determinants of prediction accuracy and examine the impact of annotation errors on the reconstructed genomes. We finally apply metagenomic deconvolution to samples from the Human Microbiome Project, successfully reconstructing genus-level genomic content of various microbial genera, based solely on variation in gene count. These reconstructed genera are shown to correctly capture genus-specific properties. With the accumulation of metagenomic data, this deconvolution framework provides an essential tool for characterizing microbial taxa never before seen, laying the foundation for addressing fundamental questions concerning the taxa comprising diverse microbial communities.
Collapse
Affiliation(s)
- Rogan Carr
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| | - Shai S. Shen-Orr
- Department of Immunology, Rappaport Institute of Medical Research, Faculty of Medicine and Faculty of Biology, Technion, Haifa, Israel
| | - Elhanan Borenstein
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
- Department of Computer Science and Engineering, University of Washington, Seattle, Washington, United States of America
- Santa Fe Institute, Santa Fe, New Mexico, United States of America
| |
Collapse
|
34
|
Wang J, McLenachan PA, Biggs PJ, Winder LH, Schoenfeld BIK, Narayan VV, Phiri BJ, Lockhart PJ. Environmental bio-monitoring with high-throughput sequencing. Brief Bioinform 2013; 14:575-88. [DOI: 10.1093/bib/bbt032] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
|
35
|
Wang Y, Leung HCM, Yiu SM, Chin FYL. MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinformatics 2013; 28:i356-i362. [PMID: 22962452 PMCID: PMC3436824 DOI: 10.1093/bioinformatics/bts397] [Citation(s) in RCA: 97] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Motivation: Metagenomic binning remains an important topic in metagenomic analysis. Existing unsupervised binning methods for next-generation sequencing (NGS) reads do not perform well on (i) samples with low-abundance species or (ii) samples (even with high abundance) when there are many extremely low-abundance species. These two problems are common for real metagenomic datasets. Binning methods that can solve these problems are desirable. Results: We proposed a two-round binning method (MetaCluster 5.0) that aims at identifying both low-abundance and high-abundance species in the presence of a large amount of noise due to many extremely low-abundance species. In summary, MetaCluster 5.0 uses a filtering strategy to remove noise from the extremely low-abundance species. It separate reads of high-abundance species from those of low-abundance species in two different rounds. To overcome the issue of low coverage for low-abundance species, multiple w values are used to group reads with overlapping w-mers, whereas reads from high-abundance species are grouped with high confidence based on a large w and then binning expands to low-abundance species using a relaxed (shorter) w. Compared to the recent tools, TOSS and MetaCluster 4.0, MetaCluster 5.0 can find more species (especially those with low abundance of say 6× to 10×) and can achieve better sensitivity and specificity using less memory and running time. Availability:http://i.cs.hku.hk/~alse/MetaCluster/ Contact:chin@cs.hku.hk
Collapse
Affiliation(s)
- Yi Wang
- Department of Computer Science, The University of Hong Kong, Hong Kong
| | | | | | | |
Collapse
|
36
|
Segata N, Boernigen D, Tickle TL, Morgan XC, Garrett WS, Huttenhower C. Computational meta'omics for microbial community studies. Mol Syst Biol 2013; 9:666. [PMID: 23670539 PMCID: PMC4039370 DOI: 10.1038/msb.2013.22] [Citation(s) in RCA: 185] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2013] [Accepted: 04/03/2013] [Indexed: 12/16/2022] Open
Abstract
Complex microbial communities are an integral part of the Earth's ecosystem and of our bodies in health and disease. In the last two decades, culture-independent approaches have provided new insights into their structure and function, with the exponentially decreasing cost of high-throughput sequencing resulting in broadly available tools for microbial surveys. However, the field remains far from reaching a technological plateau, as both computational techniques and nucleotide sequencing platforms for microbial genomic and transcriptional content continue to improve. Current microbiome analyses are thus starting to adopt multiple and complementary meta'omic approaches, leading to unprecedented opportunities to comprehensively and accurately characterize microbial communities and their interactions with their environments and hosts. This diversity of available assays, analysis methods, and public data is in turn beginning to enable microbiome-based predictive and modeling tools. We thus review here the technological and computational meta'omics approaches that are already available, those that are under active development, their success in biological discovery, and several outstanding challenges.
Collapse
Affiliation(s)
- Nicola Segata
- Biostatistics Department, Harvard School of Public Health, Boston, MA, USA
- Present address: Centre for Integrative Biology, University of Trento, Trento, Italy
| | - Daniela Boernigen
- Biostatistics Department, Harvard School of Public Health, Boston, MA, USA
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Timothy L Tickle
- Biostatistics Department, Harvard School of Public Health, Boston, MA, USA
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Xochitl C Morgan
- Biostatistics Department, Harvard School of Public Health, Boston, MA, USA
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Wendy S Garrett
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Immunology and Infectious Diseases, Harvard School of Public Health, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Curtis Huttenhower
- Biostatistics Department, Harvard School of Public Health, Boston, MA, USA
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| |
Collapse
|
37
|
Blainey PC. The future is now: single-cell genomics of bacteria and archaea. FEMS Microbiol Rev 2013; 37:407-27. [PMID: 23298390 PMCID: PMC3878092 DOI: 10.1111/1574-6976.12015] [Citation(s) in RCA: 196] [Impact Index Per Article: 17.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2012] [Revised: 11/28/2012] [Accepted: 12/20/2012] [Indexed: 01/08/2023] Open
Abstract
Interest in the expanding catalog of uncultivated microorganisms, increasing recognition of heterogeneity among seemingly similar cells, and technological advances in whole-genome amplification and single-cell manipulation are driving considerable progress in single-cell genomics. Here, the spectrum of applications for single-cell genomics, key advances in the development of the field, and emerging methodology for single-cell genome sequencing are reviewed by example with attention to the diversity of approaches and their unique characteristics. Experimental strategies transcending specific methodologies are identified and organized as a road map for future studies in single-cell genomics of environmental microorganisms. Over the next decade, increasingly powerful tools for single-cell genome sequencing and analysis will play key roles in accessing the genomes of uncultivated organisms, determining the basis of microbial community functions, and fundamental aspects of microbial population biology.
Collapse
|
38
|
Leis B, Angelov A, Liebl W. Screening and expression of genes from metagenomes. ADVANCES IN APPLIED MICROBIOLOGY 2013; 83:1-68. [PMID: 23651593 DOI: 10.1016/b978-0-12-407678-5.00001-5] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Microorganisms are the most abundant and widely spread organisms on earth. They colonize a huge variety of natural and anthropogenic environments, including very specialized ecological niches and even extreme habitats, which are made possible by the immense metabolic diversity and genetic adaptability of microbes. As most of the organisms from environmental samples defy cultivation, cultivation-independent metagenomics approaches have been applied since more than one decade to access and characterize the phylogenetic diversity in microbial communities as well as their metabolic potential and ecological functions. Thereby, metagenomics has fully emerged as an own scientific field for mining new biocatalysts for many industrially relevant processes in biotechnology and pharmaceutics. This review summarizes common metagenomic approaches ranging from sampling, isolation of nucleic acids, construction of metagenomic libraries and their evaluation. Sequence-based screenings implement next-generation sequencing platforms, microarrays or PCR-based methods, while function-based analysis covers heterologous expression of metagenomic libraries in diverse screening setups. Major constraints and advantages of each strategy are described. The importance of alternative host-vector systems is discussed, and in order to underline the role of phylogenetic and physiological distance from the gene donor and the expression host employed, a case study is presented that describes the screening of a genomic library from an extreme thermophilic bacterium in both Escherichia coli and Thermus thermophilus. Metatranscriptomics, metaproteomics and single-cell-based methods are expected to complement metagenomic screening efforts to identify novel biocatalysts from environmental samples.
Collapse
Affiliation(s)
- Benedikt Leis
- Lehrstuhl für Mikrobiologie, Technische Universität München, Freising, Bavaria, Germany
| | | | | |
Collapse
|
39
|
Fast Comparison of Microbial Genomes Using the Chaos Games Representation for Metagenomic Applications. ACTA ACUST UNITED AC 2013. [DOI: 10.1016/j.procs.2013.05.304] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
40
|
Maillet N, Lemaitre C, Chikhi R, Lavenier D, Peterlongo P. Compareads: comparing huge metagenomic experiments. BMC Bioinformatics 2012; 13 Suppl 19:S10. [PMID: 23282463 PMCID: PMC3526429 DOI: 10.1186/1471-2105-13-s19-s10] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Background Nowadays, metagenomic sample analyses are mainly achieved by comparing them with a priori knowledge stored in data banks. While powerful, such approaches do not allow to exploit unknown and/or "unculturable" species, for instance estimated at 99% for Bacteria. Methods This work introduces Compareads, a de novo comparative metagenomic approach that returns the reads that are similar between two possibly metagenomic datasets generated by High Throughput Sequencers. One originality of this work consists in its ability to deal with huge datasets. The second main contribution presented in this paper is the design of a probabilistic data structure based on Bloom filters enabling to index millions of reads with a limited memory footprint and a controlled error rate. Results We show that Compareads enables to retrieve biological information while being able to scale to huge datasets. Its time and memory features make Compareads usable on read sets each composed of more than 100 million Illumina reads in a few hours and consuming 4 GB of memory, and thus usable on today's personal computers. Conclusion Using a new data structure, Compareads is a practical solution for comparing de novo huge metagenomic samples. Compareads is released under the CeCILL license and can be freely downloaded from http://alcovna.genouest.org/compareads/.
Collapse
Affiliation(s)
- Nicolas Maillet
- INRIA Rennes - Bretagne Atlantique/IRISA, EPI GenScale, Rennes, France.
| | | | | | | | | |
Collapse
|
41
|
Hu Z, Speth DR, Francoijs KJ, Quan ZX, Jetten MSM. Metagenome Analysis of a Complex Community Reveals the Metabolic Blueprint of Anammox Bacterium "Candidatus Jettenia asiatica". Front Microbiol 2012; 3:366. [PMID: 23112795 PMCID: PMC3482989 DOI: 10.3389/fmicb.2012.00366] [Citation(s) in RCA: 68] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2012] [Accepted: 09/27/2012] [Indexed: 11/30/2022] Open
Abstract
Anaerobic ammonium-oxidizing (anammox) bacteria are key players in the global nitrogen cycle and responsible for significant global nitrogen loss. Moreover, the anammox process is widely implemented for nitrogen removal from wastewaters as a cost-effective and environment-friendly alternative to conventional nitrification-denitrification systems. Currently, five genera of anammox bacteria have been identified, together forming a deep-branching order in the Planctomycetes-Verrucomicrobium-Chlamydiae superphylum. Members of all genera have been detected in wastewater treatment plants and have been enriched in lab-scale bioreactors, but genome information is not yet available for all genera. Here we report the metagenomic analysis of a granular sludge anammox reactor dominated (∼50%) by “Candidatus Jettenia asiatica.” The metagenome was sequenced using both Illumina and 454 pyrosequencing. After de novo assembly 37,432 contigs with an average length of 571 nt were obtained. The contigs were then analyzed by BLASTx searches against the protein sequences of “Candidatus Kuenenia stuttgartiensis” and a set of 25 genes essential in anammox metabolism were detected. Additionally all reads were mapped to the genome of an anammox strain KSU-1 and de novo assembly was performed again using the reads that could be mapped on KSU-1. Using this approach, a gene encoding copper-containing nitrite reductase NirK was identified in the genome, instead of cytochrome cd1-type nitrite reductase (NirS, present in “Ca. Kuenenia stuttgartiensis” and “Ca. Scalindua profunda”). Finally, the community composition was investigated through MetaCluster analysis, 16S rRNA gene analysis and read mapping, which showed the presence of other important community members such as aerobic ammonia-oxidizing bacteria, methanogens, and the denitrifying methanotroph “Ca. Methylomirabilis oxyfera”, indicating a possible active methane and nitrogen cycle in the bioreactor under the prevailing operational conditions.
Collapse
Affiliation(s)
- Ziye Hu
- Department of Microbiology, Institute for Water and Wetland Research, Radboud University Nijmegen Nijmegen, Netherlands
| | | | | | | | | |
Collapse
|
42
|
Fancello L, Raoult D, Desnues C. Computational tools for viral metagenomics and their application in clinical research. Virology 2012; 434:162-74. [PMID: 23062738 PMCID: PMC7111993 DOI: 10.1016/j.virol.2012.09.025] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2012] [Revised: 09/15/2012] [Accepted: 09/23/2012] [Indexed: 02/06/2023]
Abstract
There are 100 times more virions than eukaryotic cells in a healthy human body. The characterization of human-associated viral communities in a non-pathological state and the detection of viral pathogens in cases of infection are essential for medical care and epidemic surveillance. Viral metagenomics, the sequenced-based analysis of the complete collection of viral genomes directly isolated from an organism or an ecosystem, bypasses the “single-organism-level” point of view of clinical diagnostics and thus the need to isolate and culture the targeted organism. The first part of this review is dedicated to a presentation of past research in viral metagenomics with an emphasis on human-associated viral communities (eukaryotic viruses and bacteriophages). In the second part, we review more precisely the computational challenges posed by the analysis of viral metagenomes, and we illustrate the problem of sequences that do not have homologs in public databases and the possible approaches to characterize them.
Collapse
Affiliation(s)
- L Fancello
- Aix Marseille University, URMITE, UM63, CNRS 7278, IRD 198, Inserm 1095, 13005 Marseille, France
| | | | | |
Collapse
|