1
|
Lai S, Pan S, Sun C, Coelho LP, Chen WH, Zhao XM. metaMIC: reference-free misassembly identification and correction of de novo metagenomic assemblies. Genome Biol 2022; 23:242. [PMID: 36376928 PMCID: PMC9661791 DOI: 10.1186/s13059-022-02810-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Accepted: 11/01/2022] [Indexed: 11/16/2022] Open
Abstract
Evaluating the quality of metagenomic assemblies is important for constructing reliable metagenome-assembled genomes and downstream analyses. Here, we present metaMIC ( https://github.com/ZhaoXM-Lab/metaMIC ), a machine learning-based tool for identifying and correcting misassemblies in metagenomic assemblies. Benchmarking results on both simulated and real datasets demonstrate that metaMIC outperforms existing tools when identifying misassembled contigs. Furthermore, metaMIC is able to localize the misassembly breakpoints, and the correction of misassemblies by splitting at misassembly breakpoints can improve downstream scaffolding and binning results.
Collapse
Affiliation(s)
- Senying Lai
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
| | - Shaojun Pan
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
| | - Chuqing Sun
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei China
| | - Luis Pedro Coelho
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
- MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, and MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China
| | - Wei-Hua Chen
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei China
- College of Life Science, Henan Normal University, Xinxiang, Henan China
| | - Xing-Ming Zhao
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
- MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, and MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China
- State Key Laboratory of Medical Neurobiology, Institutes of Brain Science, Fudan University, Shanghai, China
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai, China
- International Human Phenome Institutes (Shanghai), Shanghai, China
- Zhangjiang Fudan International Innovation Center, Shanghai, China
| |
Collapse
|
2
|
MacDonald ML, Lee KH. EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality. BMC Bioinformatics 2021; 22:570. [PMID: 34837948 PMCID: PMC8627028 DOI: 10.1186/s12859-021-04480-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Accepted: 11/15/2021] [Indexed: 11/16/2022] Open
Abstract
Background To select the most complete, continuous, and accurate assembly for an organism of interest, comprehensive quality assessment of assemblies is necessary. We present a novel tool, called Evaluation of De Novo Assemblies (EvalDNA), which uses supervised machine learning for the quality scoring of genome assemblies and does not require an existing reference genome for accuracy assessment. Results EvalDNA calculates a list of quality metrics from an assembled sequence and applies a model created from supervised machine learning methods to integrate various metrics into a comprehensive quality score. A well-tested, accurate model for scoring mammalian genome sequences is provided as part of EvalDNA. This random forest regression model evaluates an assembled sequence based on continuity, completeness, and accuracy, and was able to explain 86% of the variation in reference-based quality scores within the testing data. EvalDNA was applied to human chromosome 14 assemblies from the GAGE study to rank genome assemblers and to compare EvalDNA to two other quality evaluation tools. In addition, EvalDNA was used to evaluate several genome assemblies of the Chinese hamster genome to help establish a better reference genome for the biopharmaceutical manufacturing community. EvalDNA was also used to assess more recent human assemblies from the QUAST-LG study completed in 2018, and its ability to score bacterial genomes was examined through application on bacterial assemblies from the GAGE-B study. Conclusions EvalDNA enables scientists to easily identify the best available genome assembly for their organism of interest without requiring a reference assembly. EvalDNA sets itself apart from other quality assessment tools by producing a quality score that enables direct comparison among assemblies from different species. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04480-2.
Collapse
Affiliation(s)
- Madolyn L MacDonald
- Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, 19711, USA.,Department of Computer and Information Sciences, University of Delaware, 18 Amstel Ave., Newark, 19716, USA.,Delaware Biotechnology Institute, University of Delaware, 15 Innovation Way, Newark, 19711, USA
| | - Kelvin H Lee
- Delaware Biotechnology Institute, University of Delaware, 15 Innovation Way, Newark, 19711, USA. .,Department of Chemical and Biomolecular Engineering, University of Delaware, 150 Academy Street, Newark, 19716, USA.
| |
Collapse
|
3
|
Kayani MUR, Huang W, Feng R, Chen L. Genome-resolved metagenomics using environmental and clinical samples. Brief Bioinform 2021; 22:bbab030. [PMID: 33758906 PMCID: PMC8425419 DOI: 10.1093/bib/bbab030] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Revised: 11/29/2020] [Accepted: 01/20/2021] [Indexed: 12/25/2022] Open
Abstract
Recent advances in high-throughput sequencing technologies and computational methods have added a new dimension to metagenomic data analysis i.e. genome-resolved metagenomics. In general terms, it refers to the recovery of draft or high-quality microbial genomes and their taxonomic classification and functional annotation. In recent years, several studies have utilized the genome-resolved metagenome analysis approach and identified previously unknown microbial species from human and environmental metagenomes. In this review, we describe genome-resolved metagenome analysis as a series of four necessary steps: (i) preprocessing of the sequencing reads, (ii) de novo metagenome assembly, (iii) genome binning and (iv) taxonomic and functional analysis of the recovered genomes. For each of these four steps, we discuss the most commonly used tools and the currently available pipelines to guide the scientific community in the recovery and subsequent analyses of genomes from any metagenome sample. Furthermore, we also discuss the tools required for validation of assembly quality as well as for improving quality of the recovered genomes. We also highlight the currently available pipelines that can be used to automate the whole analysis without having advanced bioinformatics knowledge. Finally, we will highlight the most widely adapted and actively maintained tools and pipelines that can be helpful to the scientific community in decision making before they commence the analysis.
Collapse
Affiliation(s)
- Masood ur Rehman Kayani
- Center for Microbiota and Immunological Diseases, Shanghai General Hospital, Shanghai Institute of Immunology, Shanghai Jiao Tong University, School of Medicine, Shanghai 2,000,025, China
| | - Wanqiu Huang
- Shanghai Institute of Immunology, Shanghai Jiao Tong University, School of Medicine, Shanghai 200,000, China
| | - Ru Feng
- Center for Microbiota and Immunological Diseases, Shanghai General Hospital, Shanghai Institute of Immunology, Shanghai Jiao Tong University, School of Medicine, Shanghai 2,000,025, China
| | - Lei Chen
- Center for Microbiota and Immunological Diseases, Shanghai General Hospital, Shanghai Institute of Immunology, Shanghai Jiao Tong University, School of Medicine, Shanghai 2,000,025, China
| |
Collapse
|
4
|
Meyer F, Lesker TR, Koslicki D, Fritz A, Gurevich A, Darling AE, Sczyrba A, Bremges A, McHardy AC. Tutorial: assessing metagenomics software with the CAMI benchmarking toolkit. Nat Protoc 2021; 16:1785-1801. [PMID: 33649565 DOI: 10.1038/s41596-020-00480-3] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Accepted: 11/26/2020] [Indexed: 01/31/2023]
Abstract
Computational methods are key in microbiome research, and obtaining a quantitative and unbiased performance estimate is important for method developers and applied researchers. For meaningful comparisons between methods, to identify best practices and common use cases, and to reduce overhead in benchmarking, it is necessary to have standardized datasets, procedures and metrics for evaluation. In this tutorial, we describe emerging standards in computational meta-omics benchmarking derived and agreed upon by a larger community of researchers. Specifically, we outline recent efforts by the Critical Assessment of Metagenome Interpretation (CAMI) initiative, which supplies method developers and applied researchers with exhaustive quantitative data about software performance in realistic scenarios and organizes community-driven benchmarking challenges. We explain the most relevant evaluation metrics for assessing metagenome assembly, binning and profiling results, and provide step-by-step instructions on how to generate them. The instructions use simulated mouse gut metagenome data released in preparation for the second round of CAMI challenges and showcase the use of a repository of tool results for CAMI datasets. This tutorial will serve as a reference for the community and facilitate informative and reproducible benchmarking in microbiome research.
Collapse
Affiliation(s)
- Fernando Meyer
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Till-Robin Lesker
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.,German Center for Infection Research (DZIF), Braunschweig, Germany
| | - David Koslicki
- Computer Science and Engineering, Biology, and The Huck Institutes of the Life Sciences, Penn State University, State College, PA, USA
| | - Adrian Fritz
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Alexey Gurevich
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia
| | - Aaron E Darling
- The ithree institute, University of Technology Sydney, Sydney, Australia
| | - Alexander Sczyrba
- Faculty of Technology and Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Andreas Bremges
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.,German Center for Infection Research (DZIF), Braunschweig, Germany
| | - Alice C McHardy
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.
| |
Collapse
|
5
|
Mineeva O, Rojas-Carulla M, Ley RE, Schölkopf B, Youngblut ND. DeepMAsED: evaluating the quality of metagenomic assemblies. Bioinformatics 2020; 36:3011-3017. [PMID: 32096824 DOI: 10.1093/bioinformatics/btaa124] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2019] [Revised: 01/19/2020] [Accepted: 02/18/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Methodological advances in metagenome assembly are rapidly increasing in the number of published metagenome assemblies. However, identifying misassemblies is challenging due to a lack of closely related reference genomes that can act as pseudo ground truth. Existing reference-free methods are no longer maintained, can make strong assumptions that may not hold across a diversity of research projects, and have not been validated on large-scale metagenome assemblies. RESULTS We present DeepMAsED, a deep learning approach for identifying misassembled contigs without the need for reference genomes. Moreover, we provide an in silico pipeline for generating large-scale, realistic metagenome assemblies for comprehensive model training and testing. DeepMAsED accuracy substantially exceeds the state-of-the-art when applied to large and complex metagenome assemblies. Our model estimates a 1% contig misassembly rate in two recent large-scale metagenome assembly publications. CONCLUSIONS DeepMAsED accurately identifies misassemblies in metagenome-assembled contigs from a broad diversity of bacteria and archaea without the need for reference genomes or strong modeling assumptions. Running DeepMAsED is straight-forward, as well as is model re-training with our dataset generation pipeline. Therefore, DeepMAsED is a flexible misassembly classifier that can be applied to a wide range of metagenome assembly projects. AVAILABILITY AND IMPLEMENTATION DeepMAsED is available from GitHub at https://github.com/leylabmpi/DeepMAsED. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Olga Mineeva
- Department of Empirical Inference, Max Planck Institute for Intelligent Systems, Tübingen 72076, Germany.,Department of Computer Science, ETH Zürich, Zürich 8092, Switzerland
| | - Mateo Rojas-Carulla
- Department of Empirical Inference, Max Planck Institute for Intelligent Systems, Tübingen 72076, Germany
| | - Ruth E Ley
- Department of Microbiome Science, Max Planck Institute for Developmental Biology, Tübingen 72076, Germany
| | - Bernhard Schölkopf
- Department of Empirical Inference, Max Planck Institute for Intelligent Systems, Tübingen 72076, Germany
| | - Nicholas D Youngblut
- Department of Microbiome Science, Max Planck Institute for Developmental Biology, Tübingen 72076, Germany
| |
Collapse
|
6
|
Padovani de Souza K, Setubal JC, Ponce de Leon F de Carvalho AC, Oliveira G, Chateau A, Alves R. Machine learning meets genome assembly. Brief Bioinform 2020; 20:2116-2129. [PMID: 30137230 DOI: 10.1093/bib/bby072] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2018] [Revised: 07/11/2018] [Accepted: 07/22/2018] [Indexed: 12/23/2022] Open
Abstract
MOTIVATION With the recent advances in DNA sequencing technologies, the study of the genetic composition of living organisms has become more accessible for researchers. Several advances have been achieved because of it, especially in the health sciences. However, many challenges which emerge from the complexity of sequencing projects remain unsolved. Among them is the task of assembling DNA fragments from previously unsequenced organisms, which is classified as an NP-hard (nondeterministic polynomial time hard) problem, for which no efficient computational solution with reasonable execution time exists. However, several tools that produce approximate solutions have been used with results that have facilitated scientific discoveries, although there is ample room for improvement. As with other NP-hard problems, machine learning algorithms have been one of the approaches used in recent years in an attempt to find better solutions to the DNA fragment assembly problem, although still at a low scale. RESULTS This paper presents a broad review of pioneering literature comprising artificial intelligence-based DNA assemblers-particularly the ones that use machine learning-to provide an overview of state-of-the-art approaches and to serve as a starting point for further study in this field.
Collapse
Affiliation(s)
| | - João Carlos Setubal
- University of São Paulo, Brazil.,Department of Computer Science, University of São Paulo, Brazil
| | | | | | - Annie Chateau
- Vale Technology Institute-Sustainable Development, Brazil
| | - Ronnie Alves
- Federal University of Pará, Brazil.,University of Montpellier, LIRMM, France
| |
Collapse
|
7
|
Trappe K, Marschall T, Renard BY. Detecting horizontal gene transfer by mapping sequencing reads across species boundaries. Bioinformatics 2016; 32:i595-i604. [DOI: 10.1093/bioinformatics/btw423] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
|
8
|
Huang BFF, Boutros PC. The parameter sensitivity of random forests. BMC Bioinformatics 2016; 17:331. [PMID: 27586051 PMCID: PMC5009551 DOI: 10.1186/s12859-016-1228-x] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2015] [Accepted: 08/26/2016] [Indexed: 02/07/2023] Open
Abstract
Background The Random Forest (RF) algorithm for supervised machine learning is an ensemble learning method widely used in science and many other fields. Its popularity has been increasing, but relatively few studies address the parameter selection process: a critical step in model fitting. Due to numerous assertions regarding the performance reliability of the default parameters, many RF models are fit using these values. However there has not yet been a thorough examination of the parameter-sensitivity of RFs in computational genomic studies. We address this gap here. Results We examined the effects of parameter selection on classification performance using the RF machine learning algorithm on two biological datasets with distinct p/n ratios: sequencing summary statistics (low p/n) and microarray-derived data (high p/n). Here, p, refers to the number of variables and, n, the number of samples. Our findings demonstrate that parameterization is highly correlated with prediction accuracy and variable importance measures (VIMs). Further, we demonstrate that different parameters are critical in tuning different datasets, and that parameter-optimization significantly enhances upon the default parameters. Conclusions Parameter performance demonstrated wide variability on both low and high p/n data. Therefore, there is significant benefit to be gained by model tuning RFs away from their default parameter settings. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1228-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Barbara F F Huang
- Informatics and Bio-computing Program, Ontario Institute for Cancer Research, Toronto, Canada
| | - Paul C Boutros
- Informatics and Bio-computing Program, Ontario Institute for Cancer Research, Toronto, Canada. .,Department of Medical Biophysics, University of Toronto, Toronto, Canada. .,Department of Pharmacology and Toxicology, University of Toronto, Toronto, Canada. .,MaRS Centre, 661 University Avenue, Suite 510, Toronto, Ontario, M5G 0A3, Canada.
| |
Collapse
|