1
|
Joe H, Kim HG. Multi-label classification with XGBoost for metabolic pathway prediction. BMC Bioinformatics 2024; 25:52. [PMID: 38297220 PMCID: PMC10832249 DOI: 10.1186/s12859-024-05666-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 01/22/2024] [Indexed: 02/02/2024] Open
Abstract
BACKGROUND Metabolic pathway prediction is one possible approach to address the problem in system biology of reconstructing an organism's metabolic network from its genome sequence. Recently there have been developments in machine learning-based pathway prediction methods that conclude that machine learning-based approaches are similar in performance to the most used method, PathoLogic which is a rule-based method. One issue is that previous studies evaluated PathoLogic without taxonomic pruning which decreases its performance. RESULTS In this study, we update the evaluation results from previous studies to demonstrate that PathoLogic with taxonomic pruning outperforms previous machine learning-based approaches and that further improvements in performance need to be made for them to be competitive. Furthermore, we introduce mlXGPR, a XGBoost-based metabolic pathway prediction method based on the multi-label classification pathway prediction framework introduced from mlLGPR. We also improve on this multi-label framework by utilizing correlations between labels using classifier chains. We propose a ranking method that determines the order of the chain so that lower performing classifiers are placed later in the chain to utilize the correlations between labels more. We evaluate mlXGPR with and without classifier chains on single-organism and multi-organism benchmarks. Our results indicate that mlXGPR outperform other previous pathway prediction methods including PathoLogic with taxonomic pruning in terms of hamming loss, precision and F1 score on single organism benchmarks. CONCLUSIONS The results from our study indicate that the performance of machine learning-based pathway prediction methods can be substantially improved and can even outperform PathoLogic with taxonomic pruning.
Collapse
Affiliation(s)
- Hyunwhan Joe
- Biomedical Knowledge Engineering Lab., Seoul National University, Seoul, Republic of Korea
| | - Hong-Gee Kim
- Biomedical Knowledge Engineering Lab., Seoul National University, Seoul, Republic of Korea.
- School of Dentistry and Dental Research Institute, Seoul National University, Seoul, Republic of Korea.
| |
Collapse
|
2
|
Anstett J, Plominsky AM, DeLong EF, Kiesser A, Jürgens K, Morgan-Lang C, Stepanauskas R, Stewart FJ, Ulloa O, Woyke T, Malmstrom R, Hallam SJ. A compendium of bacterial and archaeal single-cell amplified genomes from oxygen deficient marine waters. Sci Data 2023; 10:332. [PMID: 37244914 DOI: 10.1038/s41597-023-02222-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2022] [Accepted: 05/10/2023] [Indexed: 05/29/2023] Open
Abstract
Oxygen-deficient marine waters referred to as oxygen minimum zones (OMZs) or anoxic marine zones (AMZs) are common oceanographic features. They host both cosmopolitan and endemic microorganisms adapted to low oxygen conditions. Microbial metabolic interactions within OMZs and AMZs drive coupled biogeochemical cycles resulting in nitrogen loss and climate active trace gas production and consumption. Global warming is causing oxygen-deficient waters to expand and intensify. Therefore, studies focused on microbial communities inhabiting oxygen-deficient regions are necessary to both monitor and model the impacts of climate change on marine ecosystem functions and services. Here we present a compendium of 5,129 single-cell amplified genomes (SAGs) from marine environments encompassing representative OMZ and AMZ geochemical profiles. Of these, 3,570 SAGs have been sequenced to different levels of completion, providing a strain-resolved perspective on the genomic content and potential metabolic interactions within OMZ and AMZ microbiomes. Hierarchical clustering confirmed that samples from similar oxygen concentrations and geographic regions also had analogous taxonomic compositions, providing a coherent framework for comparative community analysis.
Collapse
Affiliation(s)
- Julia Anstett
- Graduate Program in Genome Sciences and Technology, Genome Sciences Centre, University of British Columbia, Vancouver, British Columbia, Canada
- Department of Microbiology and Immunology, University of British Columbia, Vancouver, British Columbia, V6T 1Z3, Canada
| | - Alvaro M Plominsky
- Department of Microbiology and Immunology, University of British Columbia, Vancouver, British Columbia, V6T 1Z3, Canada
- Marine Biology Research Division, Scripps Institution of Oceanography, University of California San Diego, La Jolla, CA, 92037, USA
| | - Edward F DeLong
- Daniel K. Inouye Center for Microbial Oceanography: Research and Education, University of Hawaii, Manoa, Honolulu, HI, 96822, USA
| | - Alyse Kiesser
- School of Engineering, The University of British Columbia, Kelowna, BC, Canada
| | - Klaus Jürgens
- Leibniz Institute for Baltic Sea Research, Warnemünde, Germany
| | - Connor Morgan-Lang
- Graduate Program in Bioinformatics, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | | | - Frank J Stewart
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA
- Center for Microbial Dynamics and Infection, Georgia Institute of Technology, Atlanta, GA, USA
- Department of Microbiology and Cell Biology, Montana State University, Bozeman, MT, USA
| | - Osvaldo Ulloa
- Departamento de Oceanografía, Universidad de Concepción, Casilla 160-C, 4070386, Concepción, Chile
- Instituto Milenio de Oceanografía, Casilla 1313, 4070386, Concepción, Chile
| | - Tanja Woyke
- Department of Energy Joint Genome Institute, Berkeley, CA, USA
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Rex Malmstrom
- Department of Energy Joint Genome Institute, Berkeley, CA, USA
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Steven J Hallam
- Graduate Program in Genome Sciences and Technology, Genome Sciences Centre, University of British Columbia, Vancouver, British Columbia, Canada.
- Department of Microbiology and Immunology, University of British Columbia, Vancouver, British Columbia, V6T 1Z3, Canada.
- Graduate Program in Bioinformatics, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada.
- Life Sciences Institute, University of British Columbia, Vancouver, BC, V6T 1Z3, Canada.
- ECOSCOPE Training Program, University of British Columbia, Vancouver, BC, V6T 1Z3, Canada.
| |
Collapse
|