1
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
2
|
Asnicar F, Thomas AM, Passerini A, Waldron L, Segata N. Machine learning for microbiologists. Nat Rev Microbiol 2024; 22:191-205. [PMID: 37968359 DOI: 10.1038/s41579-023-00984-1] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/03/2023] [Indexed: 11/17/2023]
Abstract
Machine learning is increasingly important in microbiology where it is used for tasks such as predicting antibiotic resistance and associating human microbiome features with complex host diseases. The applications in microbiology are quickly expanding and the machine learning tools frequently used in basic and clinical research range from classification and regression to clustering and dimensionality reduction. In this Review, we examine the main machine learning concepts, tasks and applications that are relevant for experimental and clinical microbiologists. We provide the minimal toolbox for a microbiologist to be able to understand, interpret and use machine learning in their experimental and translational activities.
Collapse
Affiliation(s)
- Francesco Asnicar
- Department of Cellular, Computational and Integrative Biology, University of Trento, Trento, Italy
| | - Andrew Maltez Thomas
- Department of Cellular, Computational and Integrative Biology, University of Trento, Trento, Italy
| | - Andrea Passerini
- Department of Information Engineering and Computer Science, University of Trento, Trento, Italy
| | - Levi Waldron
- Department of Cellular, Computational and Integrative Biology, University of Trento, Trento, Italy.
- Department of Epidemiology and Biostatistics, City University of New York, New York, NY, USA.
| | - Nicola Segata
- Department of Cellular, Computational and Integrative Biology, University of Trento, Trento, Italy.
- Department of Experimental Oncology, European Institute of Oncology IRCCS, Milan, Italy.
| |
Collapse
|
3
|
de Medeiros Azevedo T, Aburjaile FF, Ferreira-Neto JRC, Pandolfi V, Benko-Iseppon AM. The endophytome (plant-associated microbiome): methodological approaches, biological aspects, and biotech applications. World J Microbiol Biotechnol 2021; 37:206. [PMID: 34708327 DOI: 10.1007/s11274-021-03168-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2021] [Accepted: 10/05/2021] [Indexed: 11/25/2022]
Abstract
Similar to other organisms, plants establish interactions with a variety of microorganisms in their natural environment. The plant microbiome occupies the host plant's tissues, either internally or on its surfaces, showing interactions that can assist in its growth, development, and adaptation to face environmental stresses. The advance of metagenomics and metatranscriptomics approaches has strongly driven the study and recognition of plant microbiome impacts. Research in this regard provides comprehensive information about the taxonomic and functional aspects of microbial plant communities, contributing to a better understanding of their dynamics. Evidence of the plant microbiome's functional potential has boosted its exploitation to develop more ecological and sustainable agricultural practices that impact human health. Although microbial inoculants' development and use are promising to revolutionize crop production, interdisciplinary studies are needed to identify new candidates and promote effective practical applications. On the other hand, there are challenges in understanding and analyzing complex data generated within a plant microbiome project's scope. This review presents aspects about the complex structuring and assembly of the microbiome in the host plant's tissues, metagenomics, and metatranscriptomics approaches for its understanding, covering descriptions of recent studies concerning metagenomics to characterize the microbiome of non-model plants under different aspects. Studies involving bio-inoculants, isolated from plant microbial communities, capable of assisting in crops' productivity, are also reviewed.
Collapse
Affiliation(s)
- Thamara de Medeiros Azevedo
- Departamento de Genética, Centro de Biociências, Universidade Federal de Pernambuco (UFPE), Av. Prof. Moraes Rego, 1235 - Cidade Universitária, Recife, PE, CEP: 50670-901, Brazil
| | - Flávia Figueira Aburjaile
- Departamento de Genética, Centro de Biociências, Universidade Federal de Pernambuco (UFPE), Av. Prof. Moraes Rego, 1235 - Cidade Universitária, Recife, PE, CEP: 50670-901, Brazil
| | - José Ribamar Costa Ferreira-Neto
- Departamento de Genética, Centro de Biociências, Universidade Federal de Pernambuco (UFPE), Av. Prof. Moraes Rego, 1235 - Cidade Universitária, Recife, PE, CEP: 50670-901, Brazil
| | - Valesca Pandolfi
- Departamento de Genética, Centro de Biociências, Universidade Federal de Pernambuco (UFPE), Av. Prof. Moraes Rego, 1235 - Cidade Universitária, Recife, PE, CEP: 50670-901, Brazil
| | - Ana Maria Benko-Iseppon
- Departamento de Genética, Centro de Biociências, Universidade Federal de Pernambuco (UFPE), Av. Prof. Moraes Rego, 1235 - Cidade Universitária, Recife, PE, CEP: 50670-901, Brazil.
| |
Collapse
|
4
|
Johns H, Bernhardt J, Churilov L. Distance-based Classification and Regression Trees for the analysis of complex predictors in health and medical research. Stat Methods Med Res 2021; 30:2085-2104. [PMID: 34319834 DOI: 10.1177/09622802211032712] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Predicting patient outcomes based on patient characteristics and care processes is a common task in medical research. Such predictive features are often multifaceted and complex, and are usually simplified into one or more scalar variables to facilitate statistical analysis. This process, while necessary, results in a loss of important clinical detail. While this loss may be prevented by using distance-based predictive methods which better represent complex healthcare features, the statistical literature on such methods is limited, and the range of tools facilitating distance-based analysis is substantially smaller than those of other methods. Consequently, medical researchers must choose to either reduce complex predictive features to scalar variables to facilitate analysis, or instead use a limited number of distance-based predictive methods which may not fulfil the needs of the analysis problem at hand. We address this limitation by developing a Distance-Based extension of Classification and Regression Trees (DB-CART) capable of making distance-based predictions of categorical, ordinal and numeric patient outcomes. We also demonstrate how this extension is compatible with other extensions to CART, including a recently published method for predicting care trajectories in chronic disease. We demonstrate DB-CART by using it to expand upon previously published dose-response analysis of stroke rehabilitation data. Our method identified additional detail not captured by the previously published analysis, reinforcing previous conclusions. We also demonstrate how by combining DB-CART with other extensions to CART, the method is capable of making predictions about complex, multifaceted outcome data based on complex, multifaceted predictive features.
Collapse
Affiliation(s)
- Hannah Johns
- Center for Research Excellence in Stroke Rehabilitation and Brain Recovery, Heidelberg, VIC, Australia.,Florey Institute of Neuroscience and Mental Health, Heidelberg, VIC, Australia.,Melbourne Medical School, University of Melbourne, Parkville, VIC, Australia
| | - Julie Bernhardt
- Center for Research Excellence in Stroke Rehabilitation and Brain Recovery, Heidelberg, VIC, Australia.,Florey Institute of Neuroscience and Mental Health, Heidelberg, VIC, Australia
| | - Leonid Churilov
- Florey Institute of Neuroscience and Mental Health, Heidelberg, VIC, Australia.,Melbourne Medical School, University of Melbourne, Parkville, VIC, Australia
| |
Collapse
|
5
|
Karagöz MA, Nalbantoglu OU. Taxonomic classification of metagenomic sequences from Relative Abundance Index profiles using deep learning. Biomed Signal Process Control 2021. [DOI: 10.1016/j.bspc.2021.102539] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
|
6
|
Hopson LM, Singleton SS, David JA, Basuchoudhary A, Prast-Nielsen S, Klein P, Sen S, Mazumder R. Bioinformatics and machine learning in gastrointestinal microbiome research and clinical application. PROGRESS IN MOLECULAR BIOLOGY AND TRANSLATIONAL SCIENCE 2020; 176:141-178. [PMID: 33814114 DOI: 10.1016/bs.pmbts.2020.08.011] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
The scientific community currently defines the human microbiome as all the bacteria, viruses, fungi, archaea, and eukaryotes that occupy the human body. When considering the variable locations, composition, diversity, and abundance of our microbial symbionts, the sheer volume of microorganisms reaches hundreds of trillions. With the onset of next generation sequencing (NGS), also known as high-throughput sequencing (HTS) technologies, the barriers to studying the human microbiome lowered significantly, making in-depth microbiome research accessible. Certain locations on the human body, such as the gastrointestinal, oral, nasal, and skin microbiomes have been heavily studied through community-focused projects like the Human Microbiome Project (HMP). In particular, the gastrointestinal microbiome (GM) has received significant attention due to links to neurological, immunological, and metabolic diseases, as well as cancer. Though HTS technologies allow deeper exploration of the GM, data informing the functional characteristics of microbiota and resulting effects on human function or disease are still sparse. This void is compounded by microbiome variability observed among humans through factors like genetics, environment, diet, metabolic activity, and even exercise; making GM research inherently difficult to study. This chapter describes an interdisciplinary approach to GM research with the goal of mitigating the hindrances of translating findings into a clinical setting. By applying tools and knowledge from microbiology, metagenomics, bioinformatics, machine learning, predictive modeling, and clinical study data from children with treatment-resistant epilepsy, we describe a proof-of-concept approach to clinical translation and precision application of GM research.
Collapse
Affiliation(s)
- Lindsay M Hopson
- Department of Biochemistry and Molecular Medicine, The George Washington University, Washington, DC, United States; The McCormick Genomic and Proteomic Center, The George Washington University, Washington, DC, United States; The McCormick Genomic and Proteomic Center, The George Washington University, Washington, DC, United States
| | - Stephanie S Singleton
- Department of Biochemistry and Molecular Medicine, The George Washington University, Washington, DC, United States
| | - John A David
- Department of Applied Mathematics, Virginia Military Institute, Lexington, VA, United States
| | - Atin Basuchoudhary
- Department of Economics and Business, Virginia Military Institute, Lexington, VA, United States
| | - Stefanie Prast-Nielsen
- Center for Translational Microbiome Research (CTMR), Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet, Stockholm, Sweden
| | - Pavel Klein
- Mid-Atlantic Epilepsy and Sleep Center, Bethesda, MD, United States
| | - Sabyasachi Sen
- Department of Biochemistry and Molecular Medicine, The George Washington University, Washington, DC, United States; Department of Medicine, The George Washington University, Washington, DC, United States
| | - Raja Mazumder
- Department of Biochemistry and Molecular Medicine, The George Washington University, Washington, DC, United States; The McCormick Genomic and Proteomic Center, The George Washington University, Washington, DC, United States.
| |
Collapse
|
7
|
Pérez-Cobas AE, Gomez-Valero L, Buchrieser C. Metagenomic approaches in microbial ecology: an update on whole-genome and marker gene sequencing analyses. Microb Genom 2020; 6:mgen000409. [PMID: 32706331 PMCID: PMC7641418 DOI: 10.1099/mgen.0.000409] [Citation(s) in RCA: 47] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2019] [Accepted: 06/30/2020] [Indexed: 12/23/2022] Open
Abstract
Metagenomics and marker gene approaches, coupled with high-throughput sequencing technologies, have revolutionized the field of microbial ecology. Metagenomics is a culture-independent method that allows the identification and characterization of organisms from all kinds of samples. Whole-genome shotgun sequencing analyses the total DNA of a chosen sample to determine the presence of micro-organisms from all domains of life and their genomic content. Importantly, the whole-genome shotgun sequencing approach reveals the genomic diversity present, but can also give insights into the functional potential of the micro-organisms identified. The marker gene approach is based on the sequencing of a specific gene region. It allows one to describe the microbial composition based on the taxonomic groups present in the sample. It is frequently used to analyse the biodiversity of microbial ecosystems. Despite its importance, the analysis of metagenomic sequencing and marker gene data is quite a challenge. Here we review the primary workflows and software used for both approaches and discuss the current challenges in the field.
Collapse
Affiliation(s)
- Ana Elena Pérez-Cobas
- Institut Pasteur, Biologie des Bactéries Intracellulaires, Paris, France and CNRS UMR 3525, 675724, Paris, France
| | - Laura Gomez-Valero
- Institut Pasteur, Biologie des Bactéries Intracellulaires, Paris, France and CNRS UMR 3525, 675724, Paris, France
| | - Carmen Buchrieser
- Institut Pasteur, Biologie des Bactéries Intracellulaires, Paris, France and CNRS UMR 3525, 675724, Paris, France
| |
Collapse
|
8
|
Shah RM, McKenzie EJ, Rosin MT, Jadhav SR, Gondalia SV, Rosendale D, Beale DJ. An Integrated Multi-Disciplinary Perspectivefor Addressing Challenges of the Human Gut Microbiome. Metabolites 2020; 10:E94. [PMID: 32155792 PMCID: PMC7143645 DOI: 10.3390/metabo10030094] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2020] [Revised: 02/18/2020] [Accepted: 02/27/2020] [Indexed: 02/06/2023] Open
Abstract
Our understanding of the human gut microbiome has grown exponentially. Advances in genome sequencing technologies and metagenomics analysis have enabled researchers to study microbial communities and their potential function within the context of a range of human gut related diseases and disorders. However, up until recently, much of this research has focused on characterizing the gut microbiological community structure and understanding its potential through system wide (meta) genomic and transcriptomic-based studies. Thus far, the functional output of these microbiomes, in terms of protein and metabolite expression, and within the broader context of host-gut microbiome interactions, has been limited. Furthermore, these studies highlight our need to address the issues of individual variation, and of samples as proxies. Here we provide a perspective review of the recent literature that focuses on the challenges of exploring the human gut microbiome, with a strong focus on an integrated perspective applied to these themes. In doing so, we contextualize the experimental and technical challenges of undertaking such studies and provide a framework for capitalizing on the breadth of insight such approaches afford. An integrated perspective of the human gut microbiome and the linkages to human health will pave the way forward for delivering against the objectives of precision medicine, which is targeted to specific individuals and addresses the issues and mechanisms in situ.
Collapse
Affiliation(s)
- Rohan M. Shah
- Department of Chemistry and Biotechnology, Faculty of Science, Engineering and Technology, Swinburne University of Technology, Hawthorn, VIC 3122, Australia;
- Land and Water, Commonwealth Scientific and Industrial Research Organization (CSIRO), Dutton Park, QLD 4102, Australia
| | - Elizabeth J. McKenzie
- Liggins Institute, The University of Auckland, Grafton, Auckland 1142, New Zealand; (E.J.M.); (M.T.R.)
| | - Magda T. Rosin
- Liggins Institute, The University of Auckland, Grafton, Auckland 1142, New Zealand; (E.J.M.); (M.T.R.)
| | - Snehal R. Jadhav
- Centre for Advanced Sensory Science, School of Exercise and Nutrition Sciences, Deakin University, Burwood, VIC 3125, Australia;
| | - Shakuntla V. Gondalia
- Centre for Human Psychopharmacology, Swinburne University of Technology, Hawthorn, VIC 3122, Australia;
| | | | - David J. Beale
- Land and Water, Commonwealth Scientific and Industrial Research Organization (CSIRO), Dutton Park, QLD 4102, Australia
| |
Collapse
|
9
|
Novoa EM, Jungreis I, Jaillon O, Kellis M. Elucidation of Codon Usage Signatures across the Domains of Life. Mol Biol Evol 2020; 36:2328-2339. [PMID: 31220870 PMCID: PMC6759073 DOI: 10.1093/molbev/msz124] [Citation(s) in RCA: 41] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Because of the degeneracy of the genetic code, multiple codons are translated into the same amino acid. Despite being “synonymous,” these codons are not equally used. Selective pressures are thought to drive the choice among synonymous codons within a genome, while GC content, which is typically attributed to mutational drift, is the major determinant of variation across species. Here, we find that in addition to GC content, interspecies codon usage signatures can also be detected. More specifically, we show that a single amino acid, arginine, is the major contributor to codon usage bias differences across domains of life. We then exploit this finding and show that domain-specific codon bias signatures can be used to classify a given sequence into its corresponding domain of life with high accuracy. We then wondered whether the inclusion of codon usage codon autocorrelation patterns, which reflects the nonrandom distribution of codon occurrences throughout a transcript, might improve the classification performance of our algorithm. However, we find that autocorrelation patterns are not domain-specific, and surprisingly, are unrelated to tRNA reusage, in contrast to previous reports. Instead, our results suggest that codon autocorrelation patterns are a by-product of codon optimality throughout a sequence, where highly expressed genes display autocorrelated “optimal” codons, whereas lowly expressed genes display autocorrelated “nonoptimal” codons.
Collapse
Affiliation(s)
- Eva Maria Novoa
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.,Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Garvan Institute of Medical Research, Darlinghurst, NSW, Australia.,University of New South Wales Sydney, NSW, Australia
| | - Irwin Jungreis
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.,Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Olivier Jaillon
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.,Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, Evry, France
| | - Manolis Kellis
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.,Broad Institute of MIT and Harvard, Cambridge, MA, USA
| |
Collapse
|
10
|
K-mer-Based Motif Analysis in Insect Species across Anopheles, Drosophila, and Glossina Genera and Its Application to Species Classification. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2019; 2019:4259479. [PMID: 31827584 PMCID: PMC6881769 DOI: 10.1155/2019/4259479] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/21/2019] [Revised: 09/18/2019] [Accepted: 09/28/2019] [Indexed: 11/17/2022]
Abstract
Short k-mer sequences from DNA are both conserved and diverged across species owing to their functional significance in speciation, which enables their use in many species classification algorithms. In the present study, we developed a methodology to analyze the DNA k-mers of whole genome, 5' UTR, intron, and 3' UTR regions from 58 insect species belonging to three genera of Diptera that include Anopheles, Drosophila, and Glossina. We developed an improved algorithm to predict and score k-mers based on a scheme that normalizes k-mer scores in different genomic subregions. This algorithm takes advantage of the information content of the whole genome as opposed to other algorithms or studies that analyze only a small group of genes. Our algorithm uses k-mers of lengths 7-9 bp for the whole genome, 5' and 3' UTR regions as well as the intronic regions. Taxonomical relationships based on the whole-genome k-mer signatures showed that species of the three genera clustered together quite visibly. We also improved the scoring and filtering of these k-mers for accurate species identification. The whole-genome k-mer content correlation algorithm showed that species within a single genus correlated tightly with each other as compared to other genera. The genomes of two Aedes and one Culex species were also analyzed to demonstrate how newly sequenced species can be classified using the algorithm. Furthermore, working with several dozen species has enabled us to assign a whole-genome k-mer signature for each of the 58 Dipteran species by making all-to-all pairwise comparison of the k-mer content. These signatures were used to compare the similarity between species and to identify clusters of species displaying similar signatures.
Collapse
|
11
|
Yu G, Jiang Y, Wang J, Zhang H, Luo H. BMC3C: binning metagenomic contigs using codon usage, sequence composition and read coverage. Bioinformatics 2019; 34:4172-4179. [PMID: 29947757 DOI: 10.1093/bioinformatics/bty519] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2017] [Accepted: 06/26/2018] [Indexed: 11/12/2022] Open
Abstract
Motivation Metagenomics investigates the DNA sequences directly recovered from environmental samples. It often starts with reads assembly, which leads to contigs rather than more complete genomes. Therefore, contig binning methods are subsequently used to bin contigs into genome bins. While some clustering-based binning methods have been developed, they generally suffer from problems related to stability and robustness. Results We introduce BMC3C, an ensemble clustering-based method, to accurately and robustly bin contigs by making use of DNA sequence Composition, Coverage across multiple samples and Codon usage. BMC3C begins by searching the proper number of clusters and repeatedly applying the k-means clustering with different initializations to cluster contigs. Next, a weight graph with each node representing a contig is derived from these clusters. If two contigs are frequently grouped into the same cluster, the weight between them is high, and otherwise low. BMC3C finally employs a graph partitioning technique to partition the weight graph into subgraphs, each corresponding to a genome bin. We conduct experiments on both simulated and real-world datasets to evaluate BMC3C, and compare it with the state-of-the-art binning tools. We show that BMC3C has an improved performance compared to these tools. To our knowledge, this is the first time that the codon usage features and ensemble clustering are used in metagenomic contig binning. Availability and implementation The codes of BMC3C are available at http://mlda.swu.edu.cn/codes.php?name=BMC3C. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Yuan Jiang
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Hao Zhang
- School of Life Sciences and Partner State Key Laboratory of Agrobiotechnology, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
| | - Haiwei Luo
- School of Life Sciences and Partner State Key Laboratory of Agrobiotechnology, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
| |
Collapse
|
12
|
Vilne B, Meistere I, Grantiņa-Ieviņa L, Ķibilds J. Machine Learning Approaches for Epidemiological Investigations of Food-Borne Disease Outbreaks. Front Microbiol 2019; 10:1722. [PMID: 31447800 PMCID: PMC6691741 DOI: 10.3389/fmicb.2019.01722] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Accepted: 07/12/2019] [Indexed: 12/14/2022] Open
Abstract
Foodborne diseases (FBDs) are infections of the gastrointestinal tract caused by foodborne pathogens (FBPs) such as bacteria [Salmonella, Listeria monocytogenes and Shiga toxin-producing E. coli (STEC)] and several viruses, but also parasites and some fungi. Artificial intelligence (AI) and its sub-discipline machine learning (ML) are re-emerging and gaining an ever increasing popularity in the scientific community and industry, and could lead to actionable knowledge in diverse ranges of sectors including epidemiological investigations of FBD outbreaks and antimicrobial resistance (AMR). As genotyping using whole-genome sequencing (WGS) is becoming more accessible and affordable, it is increasingly used as a routine tool for the detection of pathogens, and has the potential to differentiate between outbreak strains that are closely related, identify virulence/resistance genes and provide improved understanding of transmission events within hours to days. In most cases, the computational pipeline of WGS data analysis can be divided into four (though, not necessarily consecutive) major steps: de novo genome assembly, genome characterization, comparative genomics, and inference of phylogeny or phylogenomics. In each step, ML could be used to increase the speed and potentially the accuracy (provided increasing amounts of high-quality input data) of identification of the source of ongoing outbreaks, leading to more efficient treatment and prevention of additional cases. In this review, we explore whether ML or any other form of AI algorithms have already been proposed for the respective tasks and compare those with mechanistic model-based approaches.
Collapse
Affiliation(s)
- Baiba Vilne
- Institute of Food Safety, Animal Health and Environment—“BIOR”, Riga, Latvia
- SIA net-OMICS, Riga, Latvia
| | - Irēna Meistere
- Institute of Food Safety, Animal Health and Environment—“BIOR”, Riga, Latvia
| | | | - Juris Ķibilds
- Institute of Food Safety, Animal Health and Environment—“BIOR”, Riga, Latvia
| |
Collapse
|
13
|
Taxonomy based performance metrics for evaluating taxonomic assignment methods. BMC Bioinformatics 2019; 20:310. [PMID: 31185897 PMCID: PMC6561758 DOI: 10.1186/s12859-019-2896-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2018] [Accepted: 05/13/2019] [Indexed: 02/01/2023] Open
Abstract
Background Metagenomics experiments often make inferences about microbial communities by sequencing 16S and 18S rRNA, and taxonomic assignment is a fundamental step in such studies. This paper addresses the weaknesses in two types of metrics commonly used by previous studies for measuring the performance of existing taxonomic assignment methods: Sequence count based metrics and Binary error measurement. These metrics made performance evaluation results biased, less informative and mutually incomparable. Results We investigated weaknesses in two types of metrics and proposed new performance metrics including Average Taxonomy Distance (ATD) and ATD_by_Taxa, together with the visualized ATD plot. Conclusions By comparing the evaluation results from four popular taxonomic assignment methods across three test data sets, we found the new metrics more robust, informative and comparable.
Collapse
|
14
|
Mitra S. Multiple Data Analyses and Statistical Approaches for Analyzing Data from Metagenomic Studies and Clinical Trials. Methods Mol Biol 2019; 1910:605-634. [PMID: 31278679 DOI: 10.1007/978-1-4939-9074-0_20] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Metagenomics, also known as environmental genomics, is the study of the genomic content of a sample of organisms (microbes) obtained from a common habitat. Metagenomics and other "omics" disciplines have captured the attention of researchers for several decades. The effect of microbes in our body is a relevant concern for health studies. There are plenty of studies using metagenomics which examine microorganisms that inhabit niches in the human body, sometimes causing disease, and are often correlated with multiple treatment conditions. No matter from which environment it comes, the analyses are often aimed at determining either the presence or absence of specific species of interest in a given metagenome or comparing the biological diversity and the functional activity of a wider range of microorganisms within their communities. The importance increases for comparison within different environments such as multiple patients with different conditions, multiple drugs, and multiple time points of same treatment or same patient. Thus, no matter how many hypotheses we have, we need a good understanding of genomics, bioinformatics, and statistics to work together to analyze and interpret these datasets in a meaningful way. This chapter provides an overview of different data analyses and statistical approaches (with example scenarios) to analyze metagenomics samples from different medical projects or clinical trials.
Collapse
Affiliation(s)
- Suparna Mitra
- Leeds Institute of Medical Research, University of Leeds, Microbiology, Old Medical School, Leeds General Infirmary, Leeds LS1 3EX, West Yorkshire, UK.
| |
Collapse
|
15
|
LVQ-KNN: Composition-based DNA/RNA binning of short nucleotide sequences utilizing a prototype-based k-nearest neighbor approach. Virus Res 2018; 258:55-63. [PMID: 30291874 DOI: 10.1016/j.virusres.2018.10.002] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2018] [Revised: 09/25/2018] [Accepted: 10/02/2018] [Indexed: 11/22/2022]
Abstract
Unbiased sequencing is an upcoming method to gain information of the microbiome in a sample and for the detection of unrecognized pathogens. There are many software tools for a taxonomic classification of such metagenomics datasets available. Numerous of them have a satisfactory sensitivity and specificity for known organisms, but they fail if the sample contains unknown organisms, which cannot be detected by similarity-based classification employing available databases. However, recognition of unknowns is especially important for the detection of newly emerging pathogens, which are often RNA viruses. Here we present the composition-based analysis tool LVQ-KNN for binning unclassified nucleotide sequence reads into their provenance classes DNA or RNA. With a 5-fold cross-validation, LVQ-KNN reached correct classification rates (CCR) of up to 99.9% for the classification into DNA/RNA. Real datasets gained CCRs of up to 94.5%. Comparing the method to another composition-based analysis tool, similar or better classification results were reached. LVQ-KNN is a new tool for DNA/RNA classification of sequence reads from unbiased sequencing approaches that could be applicable for the detection of yet unknown RNA viruses in metagenomic samples. The source-code, training and test data for LVQ-KNN is available at Github (https://github.com/ab1989/LVQ-KNN).
Collapse
|
16
|
Lambert C, Braxton C, Charlebois RL, Deyati A, Duncan P, La Neve F, Malicki HD, Ribrioux S, Rozelle DK, Michaels B, Sun W, Yang Z, Khan AS. Considerations for Optimization of High-Throughput Sequencing Bioinformatics Pipelines for Virus Detection. Viruses 2018; 10:E528. [PMID: 30262776 PMCID: PMC6213042 DOI: 10.3390/v10100528] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2018] [Revised: 09/19/2018] [Accepted: 09/25/2018] [Indexed: 02/07/2023] Open
Abstract
High-throughput sequencing (HTS) has demonstrated capabilities for broad virus detection based upon discovery of known and novel viruses in a variety of samples, including clinical, environmental, and biological. An important goal for HTS applications in biologics is to establish parameter settings that can afford adequate sensitivity at an acceptable computational cost (computation time, computer memory, storage, expense or/and efficiency), at critical steps in the bioinformatics pipeline, including initial data quality assessment, trimming/cleaning, and assembly (to reduce data volume and increase likelihood of appropriate sequence identification). Additionally, the quality and reliability of the results depend on the availability of a complete and curated viral database for obtaining accurate results; selection of sequence alignment programs and their configuration, that retains specificity for broad virus detection with reduced false-positive signals; removal of host sequences without loss of endogenous viral sequences of interest; and use of a meaningful reporting format, which can retain critical information of the analysis for presentation of readily interpretable data and actionable results. Furthermore, after alignment, both automated and manual evaluation may be needed to verify the results and help assign a potential risk level to residual, unmapped reads. We hope that the collective considerations discussed in this paper aid toward optimization of data analysis pipelines for virus detection by HTS.
Collapse
Affiliation(s)
| | | | - Robert L Charlebois
- Analytical Research and Development, Sanofi Pasteur, Toronto, ON M2R 3T4, Canada.
| | | | - Paul Duncan
- Merck & Co. Inc., West Point, PA 19486, USA.
| | | | | | | | | | - Brandye Michaels
- Analytical Research and Development: Microbiology, Pfizer Inc., Andover, MA 01810, USA.
| | | | - Zhihui Yang
- Office of Applied Research and Safety Assessment, Center for Food Safety and Applied Nutrition, U.S. Food and Drug Administration, Laurel, MD 20708, USA.
| | - Arifa S Khan
- Office of Vaccines Research and Review, Center for Biologics Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, MD 20993, USA.
| |
Collapse
|
17
|
Ai D, Pan H, Huang R, Xia LC. CoreProbe: A Novel Algorithm for Estimating Relative Abundance Based on Metagenomic Reads. Genes (Basel) 2018; 9:E313. [PMID: 29925824 PMCID: PMC6027520 DOI: 10.3390/genes9060313] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2018] [Revised: 06/12/2018] [Accepted: 06/13/2018] [Indexed: 11/16/2022] Open
Abstract
With the rapid development of high-throughput sequencing technology, the analysis of metagenomic sequencing data and the accurate and efficient estimation of relative microbial abundance have become important ways to explore the microbial composition and function of microbes. In addition, the accuracy and efficiency of the relative microbial abundance estimation are closely related to the algorithm and the selection of the reference sequence for sequence alignment. We introduced the microbial core genome as the reference sequence for potential microbes in a metagenomic sample, and we constructed a finite mixture and latent Dirichlet models and used the Gibbs sampling algorithm to estimate the relative abundance of microorganisms. The simulation results showed that our approach can improve the efficiency while maintaining high accuracy and is more suitable for high-throughput metagenomic data. The new approach was implemented in our CoreProbe package which provides a pipeline for an accurate and efficient estimation of the relative abundance of microbes in a community. This tool is available free of charge from the CoreProbe's website: Access the Docker image with the following instruction: sudo docker pull panhongfei/coreprobe:1.0.
Collapse
Affiliation(s)
- Dongmei Ai
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing 100083, China.
| | - Hongfei Pan
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing 100083, China.
| | | | - Li C Xia
- Department of Medicine, Stanford University School of Medicine, 269 Campus Dr., Stanford, CA 94305, USA.
| |
Collapse
|
18
|
Almutairy M, Torng E. Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches. PLoS One 2018; 13:e0189960. [PMID: 29389989 PMCID: PMC5794061 DOI: 10.1371/journal.pone.0189960] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2017] [Accepted: 12/05/2017] [Indexed: 01/20/2023] Open
Abstract
Bioinformatics applications and pipelines increasingly use k-mer indexes to search for similar sequences. The major problem with k-mer indexes is that they require lots of memory. Sampling is often used to reduce index size and query time. Most applications use one of two major types of sampling: fixed sampling and minimizer sampling. It is well known that fixed sampling will produce a smaller index, typically by roughly a factor of two, whereas it is generally assumed that minimizer sampling will produce faster query times since query k-mers can also be sampled. However, no direct comparison of fixed and minimizer sampling has been performed to verify these assumptions. We systematically compare fixed and minimizer sampling using the human genome as our database. We use the resulting k-mer indexes for fixed sampling and minimizer sampling to find all maximal exact matches between our database, the human genome, and three separate query sets, the mouse genome, the chimp genome, and an NGS data set. We reach the following conclusions. First, using larger k-mers reduces query time for both fixed sampling and minimizer sampling at a cost of requiring more space. If we use the same k-mer size for both methods, fixed sampling requires typically half as much space whereas minimizer sampling processes queries only slightly faster. If we are allowed to use any k-mer size for each method, then we can choose a k-mer size such that fixed sampling both uses less space and processes queries faster than minimizer sampling. The reason is that although minimizer sampling is able to sample query k-mers, the number of shared k-mer occurrences that must be processed is much larger for minimizer sampling than fixed sampling. In conclusion, we argue that for any application where each shared k-mer occurrence must be processed, fixed sampling is the right sampling method.
Collapse
Affiliation(s)
- Meznah Almutairy
- Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America
- Department of Computer Science, College of Computer and Information Sciences, Imam Muhammad ibn Saud Islamic University, Riyadh, Saudi Arabia
| | - Eric Torng
- Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America
| |
Collapse
|
19
|
Herath D, Tang SL, Tandon K, Ackland D, Halgamuge SK. CoMet: a workflow using contig coverage and composition for binning a metagenomic sample with high precision. BMC Bioinformatics 2017; 18:571. [PMID: 29297295 PMCID: PMC5751405 DOI: 10.1186/s12859-017-1967-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
Background In metagenomics, the separation of nucleotide sequences belonging to an individual or closely matched populations is termed binning. Binning helps the evaluation of underlying microbial population structure as well as the recovery of individual genomes from a sample of uncultivable microbial organisms. Both supervised and unsupervised learning methods have been employed in binning; however, characterizing a metagenomic sample containing multiple strains remains a significant challenge. In this study, we designed and implemented a new workflow, Coverage and composition based binning of Metagenomes (CoMet), for binning contigs in a single metagenomic sample. CoMet utilizes coverage values and the compositional features of metagenomic contigs. The binning strategy in CoMet includes the initial grouping of contigs in guanine-cytosine (GC) content-coverage space and refinement of bins in tetranucleotide frequencies space in a purely unsupervised manner. With CoMet, the clustering algorithm DBSCAN is employed for binning contigs. The performances of CoMet were compared against four existing approaches for binning a single metagenomic sample, including MaxBin, Metawatt, MyCC (default) and MyCC (coverage) using multiple datasets including a sample comprised of multiple strains. Results Binning methods based on both compositional features and coverages of contigs had higher performances than the method which is based only on compositional features of contigs. CoMet yielded higher or comparable precision in comparison to the existing binning methods on benchmark datasets of varying complexities. MyCC (coverage) had the highest ranking score in F1-score. However, the performances of CoMet were higher than MyCC (coverage) on the dataset containing multiple strains. Furthermore, CoMet recovered contigs of more species and was 18 - 39% higher in precision than the compared existing methods in discriminating species from the sample of multiple strains. CoMet resulted in higher precision than MyCC (default) and MyCC (coverage) on a real metagenome. Conclusions The approach proposed with CoMet for binning contigs, improves the precision of binning while characterizing more species in a single metagenomic sample and in a sample containing multiple strains. The F1-scores obtained from different binning strategies vary with different datasets; however, CoMet yields the highest F1-score with a sample comprised of multiple strains. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1967-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Damayanthi Herath
- Department of Mechanical Engineering, The University of Melbourne, Parkville, Melbourne, 3010, Australia. .,Department of Computer Engineering, University of Peradeniya, Prof. E. O. E. Pereira Mawatha, Peradeniya, 20400, Sri Lanka.
| | - Sen-Lin Tang
- Biodiversity Research Center, Academia Sinica, Nan-Kang, Taipei, 11529, Taiwan
| | - Kshitij Tandon
- Biodiversity Research Center, Academia Sinica, Nan-Kang, Taipei, 11529, Taiwan.,Institute of Bioinformatics and Structural Biology, National Tsing Hua University, Hsinchu, 300, Taiwan.,Bioinformatics Program, Institute of Information Science, Taiwan International Graduate Program, Academia Sinica, Taipei, 115, Taiwan
| | - David Ackland
- Department of Biomedical Engineering, The University of Melbourne, Victoria, 3010, Australia
| | - Saman Kumara Halgamuge
- Research School of Engineering, College of Engineering and Computer Science, The Australian National University, Canberra ACT, 2601, Australia
| |
Collapse
|
20
|
Deiner K, Bik HM, Mächler E, Seymour M, Lacoursière-Roussel A, Altermatt F, Creer S, Bista I, Lodge DM, de Vere N, Pfrender ME, Bernatchez L. Environmental DNA metabarcoding: Transforming how we survey animal and plant communities. Mol Ecol 2017; 26:5872-5895. [PMID: 28921802 DOI: 10.1111/mec.14350] [Citation(s) in RCA: 627] [Impact Index Per Article: 89.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2016] [Revised: 08/31/2017] [Accepted: 09/05/2017] [Indexed: 12/14/2022]
Abstract
The genomic revolution has fundamentally changed how we survey biodiversity on earth. High-throughput sequencing ("HTS") platforms now enable the rapid sequencing of DNA from diverse kinds of environmental samples (termed "environmental DNA" or "eDNA"). Coupling HTS with our ability to associate sequences from eDNA with a taxonomic name is called "eDNA metabarcoding" and offers a powerful molecular tool capable of noninvasively surveying species richness from many ecosystems. Here, we review the use of eDNA metabarcoding for surveying animal and plant richness, and the challenges in using eDNA approaches to estimate relative abundance. We highlight eDNA applications in freshwater, marine and terrestrial environments, and in this broad context, we distill what is known about the ability of different eDNA sample types to approximate richness in space and across time. We provide guiding questions for study design and discuss the eDNA metabarcoding workflow with a focus on primers and library preparation methods. We additionally discuss important criteria for consideration of bioinformatic filtering of data sets, with recommendations for increasing transparency. Finally, looking to the future, we discuss emerging applications of eDNA metabarcoding in ecology, conservation, invasion biology, biomonitoring, and how eDNA metabarcoding can empower citizen science and biodiversity education.
Collapse
Affiliation(s)
- Kristy Deiner
- Atkinson Center for a Sustainable Future, Department of Ecology and Evolutionary Biology, Cornell University, Ithaca, NY, USA
| | - Holly M Bik
- Department of Nematology, University of California, Riverside, CA, USA
| | - Elvira Mächler
- Eawag, Swiss Federal Institute of Aquatic Science and Technology, Department of Aquatic Ecology, Dübendorf, Switzerland.,Department of Evolutionary Biology and Environmental Studies, University of Zurich, Zürich, Switzerland
| | - Mathew Seymour
- Molecular Ecology and Fisheries Genetics Laboratory, School of Biological Sciences, Environment Centre Wales Building, Bangor University, Bangor, Gwynedd, UK
| | | | - Florian Altermatt
- Eawag, Swiss Federal Institute of Aquatic Science and Technology, Department of Aquatic Ecology, Dübendorf, Switzerland.,Department of Evolutionary Biology and Environmental Studies, University of Zurich, Zürich, Switzerland
| | - Simon Creer
- Molecular Ecology and Fisheries Genetics Laboratory, School of Biological Sciences, Environment Centre Wales Building, Bangor University, Bangor, Gwynedd, UK
| | - Iliana Bista
- Molecular Ecology and Fisheries Genetics Laboratory, School of Biological Sciences, Environment Centre Wales Building, Bangor University, Bangor, Gwynedd, UK.,Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, UK
| | - David M Lodge
- Atkinson Center for a Sustainable Future, Department of Ecology and Evolutionary Biology, Cornell University, Ithaca, NY, USA
| | - Natasha de Vere
- Conservation and Research Department, National Botanic Garden of Wales, Llanarthne, Carmarthenshire, UK.,Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, Aberystwyth, UK
| | - Michael E Pfrender
- Department of Biological Sciences and Environmental Change Initiative, University of Notre Dame, Notre Dame, IN, USA
| | - Louis Bernatchez
- IBIS (Institut de Biologie Intégrative et des Systèmes), Université Laval, Québec, QC, Canada
| |
Collapse
|
21
|
Abstract
Microbiome analysis involves determining the composition and function of a community of microorganisms in a particular location. For the gastroenterologist, this technology opens up a rapidly evolving set of challenges and opportunities for generating novel insights into the health of patients on the basis of microbiota characterizations from intestinal, hepatic or extraintestinal samples. Alterations in gut microbiota composition correlate with intestinal and extraintestinal disease and, although only a few mechanisms are known, the microbiota are still an attractive target for developing biomarkers for disease detection and management as well as potential therapeutic applications. In this Review, we summarize the major decision points confronting new entrants to the field or for those designing new projects in microbiome research. We provide recommendations based on current technology options and our experience of sequencing platform choices. We also offer perspectives on future applications of microbiome research, which we hope convey the promise of this technology for clinical applications.
Collapse
|
22
|
Abstract
A new world of possibilities for “virus discovery” was opened up with high-throughput sequencing becoming available in the last decade. While scientifically metagenomic analysis was established before the start of the era of high-throughput sequencing, the availability of the first second-generation sequencers was the kick-off for diagnosticians to use sequencing for the detection of novel pathogens. Today, diagnostic metagenomics is becoming the standard procedure for the detection and genetic characterization of new viruses or novel virus variants. Here, we provide an overview about technical considerations of high-throughput sequencing-based diagnostic metagenomics together with selected examples of “virus discovery” for animal diseases or zoonoses and metagenomics for food safety or basic veterinary research.
Collapse
Affiliation(s)
- Dirk Höper
- Institute of Diagnostic Virology, Friedrich-Loeffler-Institut, Greifswald-Insel Riems, Germany.
| | - Claudia Wylezich
- Institute of Diagnostic Virology, Friedrich-Loeffler-Institut, Greifswald-Insel Riems, Germany
| | - Martin Beer
- Institute of Diagnostic Virology, Friedrich-Loeffler-Institut, Greifswald-Insel Riems, Germany
| |
Collapse
|
23
|
EnSVMB: Metagenomics Fragments Classification using Ensemble SVM and BLAST. Sci Rep 2017; 7:9440. [PMID: 28842700 PMCID: PMC5573435 DOI: 10.1038/s41598-017-09947-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2017] [Accepted: 08/01/2017] [Indexed: 01/12/2023] Open
Abstract
Metagenomics brings in new discoveries and insights into the uncultured microbial world. One fundamental task in metagenomics analysis is to determine the taxonomy of raw sequence fragments. Modern sequencing technologies produce relatively short fragments and greatly increase the number of fragments, and thus make the taxonomic classification considerably more difficult than before. Therefore, fast and accurate techniques are called to classify large-scale fragments. We propose EnSVM (Ensemble Support Vector Machine) and its advanced method called EnSVMB (EnSVM with BLAST) to accurately classify fragments. EnSVM divides fragments into a large confident (or small diffident) set, based on whether the fragments get consistent (or inconsistent) predictions from linear SVMs trained with different k-mers. Empirical study shows that sensitivity and specificity of EnSVM on confident set are higher than 90% and 97%, but on diffident set are lower than 60% and 75%. To further improve the performance on diffident set, EnSVMB takes advantage of best hits of BLAST to reclassify fragments in that set. Experimental results show EnSVM can efficiently and effectively divide fragments into confident and diffident sets, and EnSVMB achieves higher accuracy, sensitivity and more true positives than related state-of-the-art methods and holds comparable specificity with the best of them.
Collapse
|
24
|
Krishnamurthy SR, Wang D. Origins and challenges of viral dark matter. Virus Res 2017; 239:136-142. [DOI: 10.1016/j.virusres.2017.02.002] [Citation(s) in RCA: 141] [Impact Index Per Article: 20.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2016] [Revised: 01/31/2017] [Accepted: 02/06/2017] [Indexed: 02/07/2023]
|
25
|
Ferretti P, Farina S, Cristofolini M, Girolomoni G, Tett A, Segata N. Experimental metagenomics and ribosomal profiling of the human skin microbiome. Exp Dermatol 2017; 26:211-219. [DOI: 10.1111/exd.13210] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/06/2016] [Indexed: 02/06/2023]
Affiliation(s)
- Pamela Ferretti
- Centre for Integrative Biology; University of Trento; Trento Italy
| | | | | | - Giampiero Girolomoni
- Section of Dermatology; Department of Medicine; University of Verona; Verona Italy
| | - Adrian Tett
- Centre for Integrative Biology; University of Trento; Trento Italy
| | - Nicola Segata
- Centre for Integrative Biology; University of Trento; Trento Italy
| |
Collapse
|
26
|
PaPrBaG: A machine learning approach for the detection of novel pathogens from NGS data. Sci Rep 2017; 7:39194. [PMID: 28051068 PMCID: PMC5209729 DOI: 10.1038/srep39194] [Citation(s) in RCA: 38] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2016] [Accepted: 11/18/2016] [Indexed: 12/20/2022] Open
Abstract
The reliable detection of novel bacterial pathogens from next-generation sequencing data is a key challenge for microbial diagnostics. Current computational tools usually rely on sequence similarity and often fail to detect novel species when closely related genomes are unavailable or missing from the reference database. Here we present the machine learning based approach PaPrBaG (Pathogenicity Prediction for Bacterial Genomes). PaPrBaG overcomes genetic divergence by training on a wide range of species with known pathogenicity phenotype. To that end we compiled a comprehensive list of pathogenic and non-pathogenic bacteria with human host, using various genome metadata in conjunction with a rule-based protocol. A detailed comparative study reveals that PaPrBaG has several advantages over sequence similarity approaches. Most importantly, it always provides a prediction whereas other approaches discard a large number of sequencing reads with low similarity to currently known reference genomes. Furthermore, PaPrBaG remains reliable even at very low genomic coverages. CombiningPaPrBaG with existing approaches further improves prediction results.
Collapse
|
27
|
Alaimo S, Marceca GP, Giugno R, Ferro A, Pulvirenti A. Current Knowledge and Computational Techniques for Grapevine Meta-Omics Analysis. FRONTIERS IN PLANT SCIENCE 2017; 8:2241. [PMID: 29375610 PMCID: PMC5767322 DOI: 10.3389/fpls.2017.02241] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/16/2017] [Accepted: 12/20/2017] [Indexed: 05/03/2023]
Abstract
Growing grapevine (Vitis vinifera) is a key contribution to the economy of many countries. Tools provided by genomics and bioinformatics did help researchers in obtaining biological knowledge about the different cultivars. Several genetic markers for common diseases were identified. Recently, the impact of microbiome has been proved to be of fundamental importance both in humans and in plants for its ability to confer protection or induce diseases. In this review we report current knowledge about grapevine microbiome, together with a description of the available computational methodologies for meta-omics analysis.
Collapse
Affiliation(s)
- Salvatore Alaimo
- Bioinformatics Unit, Department of Clinical and Experimental Medicine, University of Catania, Catania, Italy
| | - Gioacchino P. Marceca
- Bioinformatics Unit, Department of Clinical and Experimental Medicine, University of Catania, Catania, Italy
| | - Rosalba Giugno
- Department of Computer Science, University of Verona, Verona, Italy
| | - Alfredo Ferro
- Bioinformatics Unit, Department of Clinical and Experimental Medicine, University of Catania, Catania, Italy
| | - Alfredo Pulvirenti
- Bioinformatics Unit, Department of Clinical and Experimental Medicine, University of Catania, Catania, Italy
- *Correspondence: Alfredo Pulvirenti
| |
Collapse
|
28
|
Significant loss of sensitivity and specificity in the taxonomic classification occurs when short 16S rRNA gene sequences are used. Heliyon 2016; 2:e00170. [PMID: 27699286 PMCID: PMC5037269 DOI: 10.1016/j.heliyon.2016.e00170] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2016] [Revised: 08/24/2016] [Accepted: 09/21/2016] [Indexed: 01/09/2023] Open
Abstract
The classification performance of Kraken was evaluated in terms of sensitivity and specificity when using short and long 16S rRNA sequences. A total of 440,738 sequences from bacteria with complete taxonomic classifications were downloaded from the high quality ribosomal RNA database SILVA. Amplicons produced (86,371 sequences; 1450 bp) by virtual PCR with primers covering the V1–V9 region of the 16S-rRNA gene were used as reference. Virtual PCŔs of internal fragments V3–V4, V4–V5 and V3–V5 were performed. A total of 81,523, 82,334 and 82,998 amplicons were obtained for regions V3–V4, V4–V5 and V3–V5 respectively. Differences in depth of taxonomic classification were detected among the internal fragments. For instance, sensitivity and specificity of sequences classified up to subspecies level were higher when the largest internal fraction (V3–V5) was used (54.0 and 74.6% respectively), compared to V3–V4 (45.1 and 66.7%) and V4–V5 (41.8 and 64.6%) fragments. Similar pattern was detected for sequences classified up to more superficial taxonomic categories (i.e. family, order, class…). Results also demonstrate that internal fragments lost specificity and some could be misclassified at the deepest taxonomic levels (i.e. species or subspecies). It is concluded that the larger V3–V5 fragment could be considered for massive high throughput sequencing reducing the loss of sensitivity and sensibility.
Collapse
|
29
|
Lam TTY, Zhu H, Guan Y, Holmes EC. Genomic Analysis of the Emergence, Evolution, and Spread of Human Respiratory RNA Viruses. Annu Rev Genomics Hum Genet 2016; 17:193-218. [PMID: 27216777 DOI: 10.1146/annurev-genom-083115-022628] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
The emergence and reemergence of rapidly evolving RNA viruses-particularly those responsible for respiratory diseases, such as influenza viruses and coronaviruses-pose a significant threat to global health, including the potential of major pandemics. Importantly, recent advances in high-throughput genome sequencing enable researchers to reveal the genomic diversity of these viral pathogens at much lower cost and with much greater precision than they could before. In particular, the genome sequence data generated allow inferences to be made on the molecular basis of viral emergence, evolution, and spread in human populations in real time. In this review, we introduce recent computational methods that analyze viral genomic data, particularly in combination with metadata such as sampling time, geographic location, and virulence. We then outline the insights these analyses have provided into the fundamental patterns and processes of evolution and emergence in human respiratory RNA viruses, as well as the major challenges in such genomic analyses.
Collapse
Affiliation(s)
- Tommy T-Y Lam
- State Key Laboratory of Emerging Infectious Diseases and Centre of Influenza Research, School of Public Health, The University of Hong Kong, Hong Kong, China; , ,
- Joint Influenza Research Center and Joint Institute of Virology, Shantou University Medical College, Shantou 515041, China
- State Key Laboratory of Emerging Infectious Diseases (HKU-Shenzhen Branch), Shenzhen Third People's Hospital, Shenzhen 518112, China
| | - Huachen Zhu
- State Key Laboratory of Emerging Infectious Diseases and Centre of Influenza Research, School of Public Health, The University of Hong Kong, Hong Kong, China; , ,
- Joint Influenza Research Center and Joint Institute of Virology, Shantou University Medical College, Shantou 515041, China
- State Key Laboratory of Emerging Infectious Diseases (HKU-Shenzhen Branch), Shenzhen Third People's Hospital, Shenzhen 518112, China
| | - Yi Guan
- State Key Laboratory of Emerging Infectious Diseases and Centre of Influenza Research, School of Public Health, The University of Hong Kong, Hong Kong, China; , ,
- Joint Influenza Research Center and Joint Institute of Virology, Shantou University Medical College, Shantou 515041, China
- State Key Laboratory of Emerging Infectious Diseases (HKU-Shenzhen Branch), Shenzhen Third People's Hospital, Shenzhen 518112, China
- Department of Microbiology, Guangxi Medical University, Nanning 530021, China
| | - Edward C Holmes
- Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Life and Environmental Sciences and Sydney Medical School, The University of Sydney, Sydney, New South Wales 2006, Australia;
| |
Collapse
|
30
|
Wang Y, Hu H, Li X. MBMC: An Effective Markov Chain Approach for Binning Metagenomic Reads from Environmental Shotgun Sequencing Projects. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2016; 20:470-9. [PMID: 27447888 DOI: 10.1089/omi.2016.0081] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Metagenomics is a next-generation omics field currently impacting postgenomic life sciences and medicine. Binning metagenomic reads is essential for the understanding of microbial function, compositions, and interactions in given environments. Despite the existence of dozens of computational methods for metagenomic read binning, it is still very challenging to bin reads. This is especially true for reads from unknown species, from species with similar abundance, and/or from low-abundance species in environmental samples. In this study, we developed a novel taxonomy-dependent and alignment-free approach called MBMC (Metagenomic Binning by Markov Chains). Different from all existing methods, MBMC bins reads by measuring the similarity of reads to the trained Markov chains for different taxa instead of directly comparing reads with known genomic sequences. By testing on more than 24 simulated and experimental datasets with species of similar abundance, species of low abundance, and/or unknown species, we report here that MBMC reliably grouped reads from different species into separate bins. Compared with four existing approaches, we demonstrated that the performance of MBMC was comparable with existing approaches when binning reads from sequenced species, and superior to existing approaches when binning reads from unknown species. MBMC is a pivotal tool for binning metagenomic reads in the current era of Big Data and postgenomic integrative biology. The MBMC software can be freely downloaded at http://hulab.ucf.edu/research/projects/metagenomics/MBMC.html .
Collapse
Affiliation(s)
- Ying Wang
- 1 Department of Computer Science, University of Central Florida , Orlando, Florida
| | - Haiyan Hu
- 1 Department of Computer Science, University of Central Florida , Orlando, Florida
| | - Xiaoman Li
- 2 Burnett School of Biomedical Science, University of Central Florida , Orlando, Florida
| |
Collapse
|
31
|
Gupta A, Kumar S, Prasoodanan VPK, Harish K, Sharma AK, Sharma VK. Reconstruction of Bacterial and Viral Genomes from Multiple Metagenomes. Front Microbiol 2016; 7:469. [PMID: 27148174 PMCID: PMC4828583 DOI: 10.3389/fmicb.2016.00469] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2015] [Accepted: 03/21/2016] [Indexed: 11/13/2022] Open
Abstract
Several metagenomic projects have been accomplished or are in progress. However, in most cases, it is not feasible to generate complete genomic assemblies of species from the metagenomic sequencing of a complex environment. Only a few studies have reported the reconstruction of bacterial genomes from complex metagenomes. In this work, Binning-Assembly approach has been proposed and demonstrated for the reconstruction of bacterial and viral genomes from 72 human gut metagenomic datasets. A total 1156 bacterial genomes belonging to 219 bacterial families and, 279 viral genomes belonging to 84 viral families could be identified. More than 80% complete draft genome sequences could be reconstructed for a total of 126 bacterial and 11 viral genomes. Selected draft assembled genomes could be validated with 99.8% accuracy using their ORFs. The study provides useful information on the assembly expected for a species given its number of reads and abundance. This approach along with spiking was also demonstrated to be useful in improving the draft assembly of a bacterial genome. The Binning-Assembly approach can be successfully used to reconstruct bacterial and viral genomes from multiple metagenomic datasets obtained from similar environments.
Collapse
Affiliation(s)
- Ankit Gupta
- Metagenomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, India
| | - Sanjiv Kumar
- Department of Medicine, University of Connecticut Health Center Farmington, CT, USA
| | - Vishnu P K Prasoodanan
- Metagenomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, India
| | - K Harish
- Metagenomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, India
| | - Ashok K Sharma
- Metagenomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, India
| | - Vineet K Sharma
- Metagenomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, India
| |
Collapse
|
32
|
Langenkämper D, Jakobi T, Feld D, Jelonek L, Goesmann A, Nattkemper TW. Comparison of Acceleration Techniques for Selected Low-Level Bioinformatics Operations. Front Genet 2016; 7:5. [PMID: 26904094 PMCID: PMC4748744 DOI: 10.3389/fgene.2016.00005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2015] [Accepted: 01/17/2016] [Indexed: 12/27/2022] Open
Abstract
Within the recent years clock rates of modern processors stagnated while the demand for computing power continued to grow. This applied particularly for the fields of life sciences and bioinformatics, where new technologies keep on creating rapidly growing piles of raw data with increasing speed. The number of cores per processor increased in an attempt to compensate for slight increments of clock rates. This technological shift demands changes in software development, especially in the field of high performance computing where parallelization techniques are gaining in importance due to the pressing issue of large sized datasets generated by e.g., modern genomics. This paper presents an overview of state-of-the-art manual and automatic acceleration techniques and lists some applications employing these in different areas of sequence informatics. Furthermore, we provide examples for automatic acceleration of two use cases to show typical problems and gains of transforming a serial application to a parallel one. The paper should aid the reader in deciding for a certain techniques for the problem at hand. We compare four different state-of-the-art automatic acceleration approaches (OpenMP, PluTo-SICA, PPCG, and OpenACC). Their performance as well as their applicability for selected use cases is discussed. While optimizations targeting the CPU worked better in the complex k-mer use case, optimizers for Graphics Processing Units (GPUs) performed better in the matrix multiplication example. But performance is only superior at a certain problem size due to data migration overhead. We show that automatic code parallelization is feasible with current compiler software and yields significant increases in execution speed. Automatic optimizers for CPU are mature and usually no additional manual adjustment is required. In contrast, some automatic parallelizers targeting GPUs still lack maturity and are limited to simple statements and structures.
Collapse
Affiliation(s)
- Daniel Langenkämper
- Biodata Mining Group, Faculty of Technology, Bielefeld University Bielefeld, Germany
| | - Tobias Jakobi
- Sektion für Bioinformatik und Systemkardiologie, Universitätsklinikum Heidelberg Heidelberg, Germany
| | | | - Lukas Jelonek
- Bioinformatik und Systembiologie, Justus Liebig University Gießen, Germany
| | - Alexander Goesmann
- Bioinformatik und Systembiologie, Justus Liebig University Gießen, Germany
| | - Tim W Nattkemper
- Biodata Mining Group, Faculty of Technology, Bielefeld University Bielefeld, Germany
| |
Collapse
|
33
|
Le VV, Tran LV, Tran HV. A novel semi-supervised algorithm for the taxonomic assignment of metagenomic reads. BMC Bioinformatics 2016; 17:22. [PMID: 26740458 PMCID: PMC4702387 DOI: 10.1186/s12859-015-0872-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2015] [Accepted: 12/22/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Taxonomic assignment is a crucial step in a metagenomic project which aims to identify the origin of sequences in an environmental sample. Among the existing methods, since composition-based algorithms are not sufficient for classifying short reads, recent algorithms use only the feature of similarity, or similarity-based combined features. However, those algorithms suffer from the computational expense because the task of similarity search is very time-consuming. Besides, the lack of similarity information between reads and reference sequences due to the length of short reads reduces significantly the classification quality. RESULTS This paper presents a novel taxonomic assignment algorithm, called SeMeta, which is based on semi-supervised learning to produce a fast and highly accurate classification of short-length reads with sufficient mutual overlap. The proposed algorithm firstly separates reads into clusters using their composition feature. It then labels the clusters with the support of an efficient filtering technique on results of the similarity search between their reads and reference databases. Furthermore, instead of performing the similarity search for all reads in the clusters, SeMeta only does for reads in their subgroups by utilizing the information of sequence overlapping. The experimental results demonstrate that SeMeta outperforms two other similarity-based algorithms on different aspects. CONCLUSIONS By using a semi-supervised method as well as taking the advantages of various features, the proposed algorithm is able not only to achieve high classification quality, but also to reduce much computational cost. The source codes of the algorithm can be downloaded at http://it.hcmute.edu.vn/bioinfo/metapro/SeMeta.html.
Collapse
Affiliation(s)
- Vinh Van Le
- Faculty of Computer Science and Engineering, HCMC University of Technology, 268 Ly Thuong Kiet, Q10, HCM City, Vietnam.
- Faculty of Information Technology, HCMC University of Technology and Education, 1 Vo Van Ngan, Thu Duc, HCM City, Vietnam.
| | - Lang Van Tran
- Institute of Applied Mechanics and Informatics, Vietnam Academy of Science and Technology, 01 Mac Dinh Chi, Q1, HCM City, Vietnam.
- Faculty of Information Technology, Lac Hong University, 10 Huynh Van Nghe, Bien Hoa, Dong Nai, Vietnam.
| | - Hoai Van Tran
- Faculty of Computer Science and Engineering, HCMC University of Technology, 268 Ly Thuong Kiet, Q10, HCM City, Vietnam.
| |
Collapse
|
34
|
Peabody MA, Van Rossum T, Lo R, Brinkman FSL. Evaluation of shotgun metagenomics sequence classification methods using in silico and in vitro simulated communities. BMC Bioinformatics 2015; 16:363. [PMID: 26537885 PMCID: PMC4634789 DOI: 10.1186/s12859-015-0788-5] [Citation(s) in RCA: 90] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2015] [Accepted: 10/20/2015] [Indexed: 01/14/2023] Open
Abstract
Background The field of metagenomics (study of genetic material recovered directly from an environment) has grown rapidly, with many bioinformatics analysis methods being developed. To ensure appropriate use of such methods, robust comparative evaluation of their accuracy and features is needed. For taxonomic classification of sequence reads, such evaluation should include use of clade exclusion, which better evaluates a method’s accuracy when identical sequences are not present in any reference database, as is common in metagenomic analysis. To date, relatively small evaluations have been performed, with evaluation approaches like clade exclusion limited to assessment of new methods by the authors of the given method. What is needed is a rigorous, independent comparison between multiple major methods, using the same in silico and in vitro test datasets, with and without approaches like clade exclusion, to better characterize accuracy under different conditions. Results An overview of the features of 38 bioinformatics methods is provided, evaluating accuracy with a focus on 11 programs that have reference databases that can be modified and therefore most robustly evaluated with clade exclusion. Taxonomic classification of sequence reads was evaluated using both in silico and in vitro mock bacterial communities. Clade exclusion was used at taxonomic levels from species to class—identifying how well methods perform in progressively more difficult scenarios. A wide range of variability was found in the sensitivity, precision, overall accuracy, and computational demand for the programs evaluated. In experiments where distilled water was spiked with only 11 bacterial species, frequently dozens to hundreds of species were falsely predicted by the most popular programs. The different features of each method (forces predictions or not, etc.) are summarized, and additional analysis considerations discussed. Conclusions The accuracy of shotgun metagenomics classification methods varies widely. No one program clearly outperformed others in all evaluation scenarios; rather, the results illustrate the strengths of different methods for different purposes. Researchers must appreciate method differences, choosing the program best suited for their particular analysis to avoid very misleading results. Use of standardized datasets for method comparisons is encouraged, as is use of mock microbial community controls suitable for a particular metagenomic analysis. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0788-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Michael A Peabody
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, BC, Canada.
| | - Thea Van Rossum
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, BC, Canada.
| | - Raymond Lo
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, BC, Canada.
| | - Fiona S L Brinkman
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, BC, Canada.
| |
Collapse
|
35
|
Ju F, Zhang T. Experimental Design and Bioinformatics Analysis for the Application of Metagenomics in Environmental Sciences and Biotechnology. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2015; 49:12628-40. [PMID: 26451629 DOI: 10.1021/acs.est.5b03719] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Recent advances in DNA sequencing technologies have prompted the widespread application of metagenomics for the investigation of novel bioresources (e.g., industrial enzymes and bioactive molecules) and unknown biohazards (e.g., pathogens and antibiotic resistance genes) in natural and engineered microbial systems across multiple disciplines. This review discusses the rigorous experimental design and sample preparation in the context of applying metagenomics in environmental sciences and biotechnology. Moreover, this review summarizes the principles, methodologies, and state-of-the-art bioinformatics procedures, tools and database resources for metagenomics applications and discusses two popular strategies (analysis of unassembled reads versus assembled contigs/draft genomes) for quantitative or qualitative insights of microbial community structure and functions. Overall, this review aims to facilitate more extensive application of metagenomics in the investigation of uncultured microorganisms, novel enzymes, microbe-environment interactions, and biohazards in biotechnological applications where microbial communities are engineered for bioenergy production, wastewater treatment, and bioremediation.
Collapse
Affiliation(s)
- Feng Ju
- Environmental Biotechnology Lab, Department of Civil Engineering, The University of Hong Kong , Hong Kong SRA, China
| | - Tong Zhang
- Environmental Biotechnology Lab, Department of Civil Engineering, The University of Hong Kong , Hong Kong SRA, China
| |
Collapse
|
36
|
Aflitos SA, Severing E, Sanchez-Perez G, Peters S, de Jong H, de Ridder D. Cnidaria: fast, reference-free clustering of raw and assembled genome and transcriptome NGS data. BMC Bioinformatics 2015; 16:352. [PMID: 26525298 PMCID: PMC4630969 DOI: 10.1186/s12859-015-0806-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2015] [Accepted: 10/29/2015] [Indexed: 12/05/2022] Open
Abstract
Background Identification of biological specimens is a requirement for a range of applications. Reference-free methods analyse unprocessed sequencing data without relying on prior knowledge, but generally do not scale to arbitrarily large genomes and arbitrarily large phylogenetic distances. Results We present Cnidaria, a practical tool for clustering genomic and transcriptomic data with no limitation on genome size or phylogenetic distances. We successfully simultaneously clustered 169 genomic and transcriptomic datasets from 4 kingdoms, achieving 100 % identification accuracy at supra-species level and 78 % accuracy at the species level. Conclusion CNIDARIA allows for fast, resource-efficient comparison and identification of both raw and assembled genome and transcriptome data. This can help answer both fundamental (e.g. in phylogeny, ecological diversity analysis) and practical questions (e.g. sequencing quality control, primer design). Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0806-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Saulo Alves Aflitos
- Applied Bioinformatics, Plant Research International, Wageningen, The Netherlands. .,Bioinformatics Group, Department of Plant Sciences, Wageningen University, Wageningen, The Netherlands.
| | - Edouard Severing
- Laboratory of Genetics, Wageningen University, Wageningen, The Netherlands.
| | - Gabino Sanchez-Perez
- Applied Bioinformatics, Plant Research International, Wageningen, The Netherlands. .,Bioinformatics Group, Department of Plant Sciences, Wageningen University, Wageningen, The Netherlands.
| | - Sander Peters
- Applied Bioinformatics, Plant Research International, Wageningen, The Netherlands.
| | - Hans de Jong
- Laboratory of Genetics, Wageningen University, Wageningen, The Netherlands.
| | - Dick de Ridder
- Bioinformatics Group, Department of Plant Sciences, Wageningen University, Wageningen, The Netherlands.
| |
Collapse
|
37
|
MetaObtainer: A Tool for Obtaining Specified Species from Metagenomic Reads of Next-generation Sequencing. Interdiscip Sci 2015; 7:405-13. [PMID: 26293485 DOI: 10.1007/s12539-015-0281-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2014] [Revised: 07/26/2014] [Accepted: 08/07/2014] [Indexed: 10/23/2022]
Abstract
Reads classification is an important fundamental problem in metagenomics study. With the development of next-generation sequencing, metagenome samples can be generated using much less money and time. However, the short reads generated by next-generation sequencing make the problem of reads classification much more difficult than before. None of the existing tools can assign NGS short reads to each genome accurately, which limit their use in real application. Fortunately, in many applications, it is meaningless to separate all the species in the metagenome sample from each other. That is because we usually only focus on some specified species categories in the sample and do not care about the others. There is no existing tool that is designed technically for obtaining specified species from short metagenome reads generated by next-generation sequencing. In this paper, we propose a tool named MetaObtainer to obtain the specified species from next-generation sequencing short reads. The tool synthesizes some of newest technologies for processing of short reads, so it can have better performance than other tools. It can (1) deal with next-generation sequencing reads which are shorter than 100 bp with very high accuracy (both of precision and recall are more than 90%); (2) find unknown species using the reference genomes of species which are similar with it; (3) perform well when reads of specified species are very few in the dataset; (4) handle genomes of similar abundance levels as well as different abundance levels (1:10); and (5) obtain multiple species categories from metagenome sample.
Collapse
|
38
|
Junier T, Hervé V, Wunderlin T, Junier P. MLgsc: A Maximum-Likelihood General Sequence Classifier. PLoS One 2015; 10:e0129384. [PMID: 26148002 PMCID: PMC4492669 DOI: 10.1371/journal.pone.0129384] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2014] [Accepted: 05/07/2015] [Indexed: 11/28/2022] Open
Abstract
We present software package for classifying protein or nucleotide sequences to user-specified sets of reference sequences. The software trains a model using a multiple sequence alignment and a phylogenetic tree, both supplied by the user. The latter is used to guide model construction and as a decision tree to speed up the classification process. The software was evaluated on all the 16S rRNA gene sequences of the reference dataset found in the GreenGenes database. On this dataset, the software was shown to achieve an error rate of around 1% at genus level. Examples of applications based on the nitrogenase subunit NifH gene and a protein-coding gene found in endospore-forming Firmicutes is also presented. The programs in the package have a simple, straightforward command-line interface for the Unix shell, and are free and open-source. The package has minimal dependencies and thus can be easily integrated in command-line based classification pipelines.
Collapse
Affiliation(s)
- Thomas Junier
- Laboratory of Microbiology, University of Neuchâtel, Neuchâtel, Neuchâtel, Switzerland
- Vital-IT Group, Swiss Institute of Bioinformatics, Lausanne, Vaud, Switzerland
- * E-mail:
| | - Vincent Hervé
- Laboratory of Microbiology, University of Neuchâtel, Neuchâtel, Neuchâtel, Switzerland
- Laboratory of Biogeosciences, Institute of Earth Sciences, University of Lausanne, Lausanne, Vaud, Switzerland
| | - Tina Wunderlin
- Laboratory of Microbiology, University of Neuchâtel, Neuchâtel, Neuchâtel, Switzerland
| | - Pilar Junier
- Laboratory of Microbiology, University of Neuchâtel, Neuchâtel, Neuchâtel, Switzerland
| |
Collapse
|
39
|
Lee J, Lee HT, Hong WY, Jang E, Kim J. FCMM: A comparative metagenomic approach for functional characterization of multiple metagenome samples. J Microbiol Methods 2015; 115:121-8. [PMID: 26027543 DOI: 10.1016/j.mimet.2015.05.023] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2015] [Revised: 05/18/2015] [Accepted: 05/26/2015] [Indexed: 02/01/2023]
Abstract
Next-generation sequencing (NGS) technologies make it possible to obtain the entire genomic content of microorganisms in metagenome samples. Thus, many studies have developed methods for the processing and analysis of metagenomic NGS reads, including analyses for predicting functions and their enrichments in environmental metagenome samples. Especially, comparative functional studies by using multi-metagenome samples are essential for identifying and comparing different characteristics of multiple environmental samples. In this paper, we introduce a pipeline for functional characterization of multiple metagenome samples to infer major functions as well as their quantitative scores in a comparative metagenomics manner. The pipeline performs the annotation of functions related to expected proteins in the metagenome samples, calculates their enrichment scores based on the reads per kilobase per million reads (RPKM) measure, and predicts the relative abundance of associated functions by a statistical test. The results from single sample analysis are then used to find common and sample-specific major functions. By applying the pipeline to six different environmental metagenome samples, including two ocean (Antarctica aquatic and Baltic Sea) and four terrestrial (Acid mine drainage, human gut microbiome, Amazon River, and Wasca soil) samples, we were able to predict common functions as well as environment-specific functions. Our pipeline is available at http://bioinfo.konkuk.ac.kr/FCMM/.
Collapse
Affiliation(s)
- Jongin Lee
- Department of Animal Biotechnology, Konkuk University, Seoul 143-701, Republic of Korea
| | - Hoon Taek Lee
- Department of Animal Biotechnology, Konkuk University, Seoul 143-701, Republic of Korea
| | - Woon-young Hong
- Department of Animal Biotechnology, Konkuk University, Seoul 143-701, Republic of Korea
| | - Eunji Jang
- Department of Animal Biotechnology, Konkuk University, Seoul 143-701, Republic of Korea
| | - Jaebum Kim
- Department of Animal Biotechnology, Konkuk University, Seoul 143-701, Republic of Korea.
| |
Collapse
|
40
|
Sankar SA, Lagier JC, Pontarotti P, Raoult D, Fournier PE. The human gut microbiome, a taxonomic conundrum. Syst Appl Microbiol 2015; 38:276-86. [DOI: 10.1016/j.syapm.2015.03.004] [Citation(s) in RCA: 69] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2014] [Revised: 03/17/2015] [Accepted: 03/18/2015] [Indexed: 01/16/2023]
|
41
|
Oulas A, Pavloudi C, Polymenakou P, Pavlopoulos GA, Papanikolaou N, Kotoulas G, Arvanitidis C, Iliopoulos I. Metagenomics: tools and insights for analyzing next-generation sequencing data derived from biodiversity studies. Bioinform Biol Insights 2015; 9:75-88. [PMID: 25983555 PMCID: PMC4426941 DOI: 10.4137/bbi.s12462] [Citation(s) in RCA: 177] [Impact Index Per Article: 19.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2014] [Revised: 03/09/2015] [Accepted: 03/13/2015] [Indexed: 12/14/2022] Open
Abstract
Advances in next-generation sequencing (NGS) have allowed significant breakthroughs in microbial ecology studies. This has led to the rapid expansion of research in the field and the establishment of "metagenomics", often defined as the analysis of DNA from microbial communities in environmental samples without prior need for culturing. Many metagenomics statistical/computational tools and databases have been developed in order to allow the exploitation of the huge influx of data. In this review article, we provide an overview of the sequencing technologies and how they are uniquely suited to various types of metagenomic studies. We focus on the currently available bioinformatics techniques, tools, and methodologies for performing each individual step of a typical metagenomic dataset analysis. We also provide future trends in the field with respect to tools and technologies currently under development. Moreover, we discuss data management, distribution, and integration tools that are capable of performing comparative metagenomic analyses of multiple datasets using well-established databases, as well as commonly used annotation standards.
Collapse
Affiliation(s)
- Anastasis Oulas
- Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Heraklion, Crete, Greece
| | - Christina Pavloudi
- Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Heraklion, Crete, Greece
- Department of Biology, University of Ghent, Ghent, Belgium
- Department of Microbial Ecophysiology, University of Bremen, Bremen, Germany
| | - Paraskevi Polymenakou
- Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Heraklion, Crete, Greece
| | - Georgios A Pavlopoulos
- Division of Basic Sciences, University of Crete, Medical School, Heraklion, Crete, Greece
| | - Nikolas Papanikolaou
- Division of Basic Sciences, University of Crete, Medical School, Heraklion, Crete, Greece
| | - Georgios Kotoulas
- Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Heraklion, Crete, Greece
| | - Christos Arvanitidis
- Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Heraklion, Crete, Greece
| | - Ioannis Iliopoulos
- Division of Basic Sciences, University of Crete, Medical School, Heraklion, Crete, Greece
| |
Collapse
|
42
|
Kawulok J, Deorowicz S. CoMeta: classification of metagenomes using k-mers. PLoS One 2015; 10:e0121453. [PMID: 25884504 PMCID: PMC4401624 DOI: 10.1371/journal.pone.0121453] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2014] [Accepted: 02/15/2015] [Indexed: 02/07/2023] Open
Abstract
Nowadays, the study of environmental samples has been developing rapidly. Characterization of the environment composition broadens the knowledge about the relationship between species composition and environmental conditions. An important element of extracting the knowledge of the sample composition is to compare the extracted fragments of DNA with sequences derived from known organisms. In the presented paper, we introduce an algorithm called CoMeta (Classification of metagenomes), which assigns a query read (a DNA fragment) into one of the groups previously prepared by the user. Typically, this is one of the taxonomic rank (e.g., phylum, genus), however prepared groups may contain sequences having various functions. In CoMeta, we used the exact method for read classification using short subsequences (k-mers) and fast program for indexing large set of k-mers. In contrast to the most popular methods based on BLAST, where the query is compared with each reference sequence, we begin the classification from the top of the taxonomy tree to reduce the number of comparisons. The presented experimental study confirms that CoMeta outperforms other programs used in this context. CoMeta is available at https://github.com/jkawulok/cometa under a free GNU GPL 2 license.
Collapse
Affiliation(s)
- Jolanta Kawulok
- Institute of Informatics, Silesian University of Technology, Gliwice, Poland
| | - Sebastian Deorowicz
- Institute of Informatics, Silesian University of Technology, Gliwice, Poland
| |
Collapse
|
43
|
Zhang R, Cheng Z, Guan J, Zhou S. Exploiting topic modeling to boost metagenomic reads binning. BMC Bioinformatics 2015; 16 Suppl 5:S2. [PMID: 25859745 PMCID: PMC4402587 DOI: 10.1186/1471-2105-16-s5-s2] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND With the rapid development of high-throughput technologies, researchers can sequence the whole metagenome of a microbial community sampled directly from the environment. The assignment of these metagenomic reads into different species or taxonomical classes is a vital step for metagenomic analysis, which is referred to as binning of metagenomic data. RESULTS In this paper, we propose a new method TM-MCluster for binning metagenomic reads. First, we represent each metagenomic read as a set of "k-mers" with their frequencies occurring in the read. Then, we employ a probabilistic topic model -- the Latent Dirichlet Allocation (LDA) model to the reads, which generates a number of hidden "topics" such that each read can be represented by a distribution vector of the generated topics. Finally, as in the MCluster method, we apply SKWIC -- a variant of the classical K-means algorithm with automatic feature weighting mechanism to cluster these reads represented by topic distributions. CONCLUSIONS Experiments show that the new method TM-MCluster outperforms major existing methods, including AbundanceBin, MetaCluster 3.0/5.0 and MCluster. This result indicates that the exploitation of topic modeling can effectively improve the binning performance of metagenomic reads.
Collapse
|
44
|
Wang Y, Hu H, Li X. MBBC: an efficient approach for metagenomic binning based on clustering. BMC Bioinformatics 2015; 16:36. [PMID: 25652152 PMCID: PMC4339733 DOI: 10.1186/s12859-015-0473-8] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2014] [Accepted: 01/22/2015] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND Binning environmental shotgun reads is one of the most fundamental tasks in metagenomic studies, in which mixed reads from different species or operational taxonomical units (OTUs) are separated into different groups. While dozens of binning methods are available, there is still room for improvement. RESULTS We developed a novel taxonomy-independent approach called MBBC (Metagenomic Binning Based on Clustering) to cluster environmental shotgun reads, by considering k-mer frequency in reads and Markov properties of the inferred OTUs. Tested on twelve simulated datasets, MBBC reliably estimated the species number, the genome size, and the relative abundance of each species, independent of whether there are errors in reads. Tested on multiple experimental datasets, MBBC outperformed two state-of-the-art taxonomy-independent methods, in terms of the accuracy of the estimated species number, genome sizes, and percentages of correctly assigned reads, among other metrics. CONCLUSIONS We have developed a novel method for binning metagenomic reads based on clustering. This method is demonstrated to reliably predict species numbers, genome sizes, relative species abundances, and k-mer coverage in simple datasets. Our method also has a high accuracy in read binning. The MBBC software is freely available at http://eecs.ucf.edu/~xiaoman/MBBC/MBBC.html .
Collapse
Affiliation(s)
- Ying Wang
- Department of Electric Engineering and Computer Science, University of Central Florida, Orlando, FL, 32816, USA.
| | - Haiyan Hu
- Department of Electric Engineering and Computer Science, University of Central Florida, Orlando, FL, 32816, USA.
| | - Xiaoman Li
- Department of Electric Engineering and Computer Science, University of Central Florida, Orlando, FL, 32816, USA.
- Burnett School of Biomedical Science, University of Central Florida, Orlando, FL, 32816, USA.
| |
Collapse
|
45
|
Hou T, Liu F, Liu Y, Zou QY, Zhang X, Wang K. Classification of metagenomics data at lower taxonomic level using a robust supervised classifier. Evol Bioinform Online 2015; 11:3-10. [PMID: 25673967 PMCID: PMC4309676 DOI: 10.4137/ebo.s20523] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2014] [Revised: 11/25/2014] [Accepted: 12/14/2014] [Indexed: 11/11/2022] Open
Abstract
As more and more completely sequenced genomes become available, the taxonomic classification of metagenomic data will benefit greatly from supervised classifiers that can be updated instantaneously in response to new genomes. Currently, some supervised classifiers have been developed to assess the organism of metagenomic sequences. We have found that the existing supervised classifiers usually cannot discriminate the training data from different classes accurately when the data contain some outliers. However, the training genomic data (bacterial and archaeal genomes) usually contain a portion of outliers, which come from sequencing errors, phage invasions, and some highly expressed genes, etc. The outliers, treated as noises, prohibit the development of classifiers with better prediction accuracy. To solve the problem, we present a robust supervised classifier, weighted support vector domain description (WSVDD), which can eliminate the interference from some outliers for training genomic data and then generate more accurate data domain descriptions for each taxonomic class. The experimental results demonstrate WSVDD is more robust than other classifiers for simulated Sanger and 454 reads with different outlier rates. In addition, in experiments performed on simulated metagenomes and real gut metagenomes, WSVDD also achieved better prediction accuracy than other classifiers.
Collapse
Affiliation(s)
- Tao Hou
- College of Communications Engineering, Jilin University, Changchun, China
| | - Fu Liu
- College of Communications Engineering, Jilin University, Changchun, China
| | - Yun Liu
- College of Communications Engineering, Jilin University, Changchun, China
| | - Qing Yu Zou
- College of Communications Engineering, Jilin University, Changchun, China
| | - Xiao Zhang
- College of Communications Engineering, Jilin University, Changchun, China
| | - Ke Wang
- College of Communications Engineering, Jilin University, Changchun, China
| |
Collapse
|
46
|
Vinh LV, Lang TV, Binh LT, Hoai TV. A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads. Algorithms Mol Biol 2015; 10:2. [PMID: 25648210 PMCID: PMC4304631 DOI: 10.1186/s13015-014-0030-4] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2014] [Accepted: 10/20/2014] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND Metagenomics is the study of genetic materials derived directly from complex microbial samples, instead of from culture. One of the crucial steps in metagenomic analysis, referred to as "binning", is to separate reads into clusters that represent genomes from closely related organisms. Among the existing binning methods, unsupervised methods base the classification on features extracted from reads, and especially taking advantage in case of the limitation of reference database availability. However, their performance, under various aspects, is still being investigated by recent theoretical and empirical studies. The one addressed in this paper is among those efforts to enhance the accuracy of the classification. RESULTS This paper presents an unsupervised algorithm, called BiMeta, for binning of reads from different species in a metagenomic dataset. The algorithm consists of two phases. In the first phase of the algorithm, reads are grouped into groups based on overlap information between the reads. The second phase merges the groups by using an observation on l-mer frequency distribution of sets of non-overlapping reads. The experimental results on simulated and real datasets showed that BiMeta outperforms three state-of-the-art binning algorithms for both short and long reads (≥700 b p) datasets. CONCLUSIONS This paper developed a novel and efficient algorithm for binning of metagenomic reads, which does not require any reference database. The software implementing the algorithm and all test datasets mentioned in this paper can be downloaded at http://it.hcmute.edu.vn/bioinfo/bimeta/index.htm.
Collapse
Affiliation(s)
- Le Van Vinh
- />Faculty of Computer Science and Engineering, HCMC University of Technology, 268 Ly Thuong Kiet, Q10, Ho Chi Minh City, Vietnam
| | - Tran Van Lang
- />Institute of Applied Mechanics and Informatics, Vietnam Academy of Science and Technology (VAST), 01 Mac Dinh Chi, Q1, Ho Chi Minh City, Vietnam
- />Faculty of Information Technology, Lac Hong University, 10 Huynh Van Nghe, Bien Hoa, Dong Nai Vietnam
| | - Le Thanh Binh
- />Institute of Biotechnology, Vietnam Academy of Science and Technology (VAST), 18 Hoang Quoc Viet, Cau Giay, Ha Noi Vietnam
| | - Tran Van Hoai
- />Faculty of Computer Science and Engineering, HCMC University of Technology, 268 Ly Thuong Kiet, Q10, Ho Chi Minh City, Vietnam
| |
Collapse
|
47
|
Abstract
Traditionally, microbial genome sequencing has been restricted to the small number of species that can be grown in pure culture. The progressive development of culture-independent methods over the last 15 years now allows researchers to sequence microbial communities directly from environmental samples. This approach is commonly referred to as "metagenomics" or "community genomics". However, the term metagenomics is applied liberally in the literature to describe any culture-independent analysis of microbial communities. Here, we define metagenomics as shotgun ("random") sequencing of the genomic DNA of a sample taken directly from the environment. The metagenome can be thought of as a sampling of the collective genome of the microbial community. We outline the considerations and analyses that should be undertaken to ensure the success of a metagenomic sequencing project, including the choice of sequencing platform and methods for assembly, binning, annotation, and comparative analysis.
Collapse
Affiliation(s)
- Lauren Bragg
- Advanced Water Management Centre, The University of Queensland, St. Lucia, QLD, Australia
| | | |
Collapse
|
48
|
Ahn TH, Chai J, Pan C. Sigma: strain-level inference of genomes from metagenomic analysis for biosurveillance. ACTA ACUST UNITED AC 2014; 31:170-7. [PMID: 25266224 PMCID: PMC4287953 DOI: 10.1093/bioinformatics/btu641] [Citation(s) in RCA: 65] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
Motivation: Metagenomic sequencing of clinical samples provides a promising technique for direct pathogen detection and characterization in biosurveillance. Taxonomic analysis at the strain level can be used to resolve serotypes of a pathogen in biosurveillance. Sigma was developed for strain-level identification and quantification of pathogens using their reference genomes based on metagenomic analysis. Results: Sigma provides not only accurate strain-level inferences, but also three unique capabilities: (i) Sigma quantifies the statistical uncertainty of its inferences, which includes hypothesis testing of identified genomes and confidence interval estimation of their relative abundances; (ii) Sigma enables strain variant calling by assigning metagenomic reads to their most likely reference genomes; and (iii) Sigma supports parallel computing for fast analysis of large datasets. The algorithm performance was evaluated using simulated mock communities and fecal samples with spike-in pathogen strains. Availability and Implementation: Sigma was implemented in C++ with source codes and binaries freely available at http://sigma.omicsbio.org. Contact:panc@ornl.gov Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tae-Hyuk Ahn
- Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA
| | - Juanjuan Chai
- Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA
| | - Chongle Pan
- Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA
| |
Collapse
|
49
|
Deusch O, O’Flynn C, Colyer A, Morris P, Allaway D, Jones PG, Swanson KS. Deep Illumina-based shotgun sequencing reveals dietary effects on the structure and function of the fecal microbiome of growing kittens. PLoS One 2014; 9:e101021. [PMID: 25010839 PMCID: PMC4091873 DOI: 10.1371/journal.pone.0101021] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2013] [Accepted: 06/02/2014] [Indexed: 12/22/2022] Open
Abstract
Background Previously, we demonstrated that dietary protein:carbohydrate ratio dramatically affects the fecal microbial taxonomic structure of kittens using targeted 16S gene sequencing. The present study, using the same fecal samples, applied deep Illumina shotgun sequencing to identify the diet-associated functional potential and analyze taxonomic changes of the feline fecal microbiome. Methodology & Principal Findings Fecal samples from kittens fed one of two diets differing in protein and carbohydrate content (high–protein, low–carbohydrate, HPLC; and moderate-protein, moderate-carbohydrate, MPMC) were collected at 8, 12 and 16 weeks of age (n = 6 per group). A total of 345.3 gigabases of sequence were generated from 36 samples, with 99.75% of annotated sequences identified as bacterial. At the genus level, 26% and 39% of reads were annotated for HPLC- and MPMC-fed kittens, with HPLC-fed cats showing greater species richness and microbial diversity. Two phyla, ten families and fifteen genera were responsible for more than 80% of the sequences at each taxonomic level for both diet groups, consistent with the previous taxonomic study. Significantly different abundances between diet groups were observed for 324 genera (56% of all genera identified) demonstrating widespread diet-induced changes in microbial taxonomic structure. Diversity was not affected over time. Functional analysis identified 2,013 putative enzyme function groups were different (p<0.000007) between the two dietary groups and were associated to 194 pathways, which formed five discrete clusters based on average relative abundance. Of those, ten contained more (p<0.022) enzyme functions with significant diet effects than expected by chance. Six pathways were related to amino acid biosynthesis and metabolism linking changes in dietary protein with functional differences of the gut microbiome. Conclusions These data indicate that feline feces-derived microbiomes have large structural and functional differences relating to the dietary protein:carbohydrate ratio and highlight the impact of diet early in life.
Collapse
Affiliation(s)
- Oliver Deusch
- WALTHAM Centre for Pet Nutrition, Waltham-on-the-Wolds, Leicestershire, United Kingdom
| | - Ciaran O’Flynn
- WALTHAM Centre for Pet Nutrition, Waltham-on-the-Wolds, Leicestershire, United Kingdom
| | - Alison Colyer
- WALTHAM Centre for Pet Nutrition, Waltham-on-the-Wolds, Leicestershire, United Kingdom
| | - Penelope Morris
- WALTHAM Centre for Pet Nutrition, Waltham-on-the-Wolds, Leicestershire, United Kingdom
| | - David Allaway
- WALTHAM Centre for Pet Nutrition, Waltham-on-the-Wolds, Leicestershire, United Kingdom
| | - Paul G. Jones
- WALTHAM Centre for Pet Nutrition, Waltham-on-the-Wolds, Leicestershire, United Kingdom
| | - Kelly S. Swanson
- Department of Animal Sciences, University of Illinois, Urbana, Illinois, United States of America
- Division of Nutritional Sciences, University of Illinois, Urbana, Illinois, United States of America
- Department of Veterinary Clinical Medicine, University of Illinois, Urbana, Illinois, United States of America
- * E-mail:
| |
Collapse
|
50
|
Darling AE, Jospin G, Lowe E, Matsen FA, Bik HM, Eisen JA. PhyloSift: phylogenetic analysis of genomes and metagenomes. PeerJ 2014; 2:e243. [PMID: 24482762 PMCID: PMC3897386 DOI: 10.7717/peerj.243] [Citation(s) in RCA: 412] [Impact Index Per Article: 41.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2013] [Accepted: 12/19/2013] [Indexed: 12/13/2022] Open
Abstract
Like all organisms on the planet, environmental microbes are subject to the forces of molecular evolution. Metagenomic sequencing provides a means to access the DNA sequence of uncultured microbes. By combining DNA sequencing of microbial communities with evolutionary modeling and phylogenetic analysis we might obtain new insights into microbiology and also provide a basis for practical tools such as forensic pathogen detection. In this work we present an approach to leverage phylogenetic analysis of metagenomic sequence data to conduct several types of analysis. First, we present a method to conduct phylogeny-driven Bayesian hypothesis tests for the presence of an organism in a sample. Second, we present a means to compare community structure across a collection of many samples and develop direct associations between the abundance of certain organisms and sample metadata. Third, we apply new tools to analyze the phylogenetic diversity of microbial communities and again demonstrate how this can be associated to sample metadata. These analyses are implemented in an open source software pipeline called PhyloSift. As a pipeline, PhyloSift incorporates several other programs including LAST, HMMER, and pplacer to automate phylogenetic analysis of protein coding and RNA sequences in metagenomic datasets generated by modern sequencing platforms (e.g., Illumina, 454).
Collapse
Affiliation(s)
- Aaron E Darling
- ithree institute, University of Technology Sydney , Sydney , Australia ; Genome Center, University of California , Davis, CA , United States of America
| | - Guillaume Jospin
- Genome Center, University of California , Davis, CA , United States of America
| | - Eric Lowe
- Genome Center, University of California , Davis, CA , United States of America
| | - Frederick A Matsen
- Fred Hutchinson Cancer Research Center , Seattle, WA , United States of America
| | - Holly M Bik
- Genome Center, University of California , Davis, CA , United States of America
| | - Jonathan A Eisen
- Department of Evolution and Ecology, University of California , Davis, CA , United States of America ; Department of Medical Microbiology and Immunology, University of California , Davis, CA , United States of America
| |
Collapse
|