1
|
Pu L, Shamir R. 4CAC: 4-class classifier of metagenome contigs using machine learning and assembly graphs. Nucleic Acids Res 2024; 52:e94. [PMID: 39287139 PMCID: PMC11514454 DOI: 10.1093/nar/gkae799] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 07/13/2024] [Accepted: 09/02/2024] [Indexed: 09/19/2024] Open
Abstract
Microbial communities usually harbor a mix of bacteria, archaea, plasmids, viruses and microeukaryotes. Within these communities, viruses, plasmids, and microeukaryotes coexist in relatively low abundance, yet they engage in intricate interactions with bacteria. Moreover, viruses and plasmids, as mobile genetic elements, play important roles in horizontal gene transfer and the development of antibiotic resistance within microbial populations. However, due to the difficulty of identifying viruses, plasmids, and microeukaryotes in microbial communities, our understanding of these minor classes lags behind that of bacteria and archaea. Recently, several classifiers have been developed to separate one or more minor classes from bacteria and archaea in metagenome assemblies. However, these classifiers often overlook the issue of class imbalance, leading to low precision in identifying the minor classes. Here, we developed a classifier called 4CAC that is able to identify viruses, plasmids, microeukaryotes, and prokaryotes simultaneously from metagenome assemblies. 4CAC generates an initial four-way classification using several sequence length-adjusted XGBoost models and further improves the classification using the assembly graph. Evaluation on simulated and real metagenome datasets demonstrates that 4CAC substantially outperforms existing classifiers and combinations thereof on short reads. On long reads, it also shows an advantage unless the abundance of the minor classes is very low. 4CAC runs 1-2 orders of magnitude faster than the other classifiers. The 4CAC software is available at https://github.com/Shamir-Lab/4CAC.
Collapse
Affiliation(s)
- Lianrong Pu
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
- School of Computer Science and Technology, Shandong University, Qingdao, China
| | - Ron Shamir
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
2
|
Rojas-López AG, Rodríguez-Molina A, Uriarte-Arcia AV, Villarreal-Cervantes MG. Vertebral Column Pathology Diagnosis Using Ensemble Strategies Based on Supervised Machine Learning Techniques. Healthcare (Basel) 2024; 12:1324. [PMID: 38998860 PMCID: PMC11241707 DOI: 10.3390/healthcare12131324] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2024] [Revised: 06/25/2024] [Accepted: 06/28/2024] [Indexed: 07/14/2024] Open
Abstract
One expanding area of bioinformatics is medical diagnosis through the categorization of biomedical characteristics. Automatic medical strategies to boost the diagnostic through machine learning (ML) methods are challenging. They require a formal examination of their performance to identify the best conditions that enhance the ML method. This work proposes variants of the Voting and Stacking (VC and SC) ensemble strategies based on diverse auto-tuning supervised machine learning techniques to increase the efficacy of traditional baseline classifiers for the automatic diagnosis of vertebral column orthopedic illnesses. The ensemble strategies are created by first combining a complete set of auto-tuned baseline classifiers based on different processes, such as geometric, probabilistic, logic, and optimization. Next, the three most promising classifiers are selected among k-Nearest Neighbors (kNN), Naïve Bayes (NB), Logistic Regression (LR), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Support Vector Machine (SVM), Artificial Neural Networks (ANN), and Decision Tree (DT). The grid-search K-Fold cross-validation strategy is applied to auto-tune the baseline classifier hyperparameters. The performances of the proposed ensemble strategies are independently compared with the auto-tuned baseline classifiers. A concise analysis evaluates accuracy, precision, recall, F1-score, and ROC-ACU metrics. The analysis also examines the misclassified disease elements to find the most and least reliable classifiers for this specific medical problem. The results show that the VC ensemble strategy provides an improvement comparable to that of the best baseline classifier (the kNN). Meanwhile, when all baseline classifiers are included in the SC ensemble, this strategy surpasses 95% in all the evaluated metrics, standing out as the most suitable option for classifying vertebral column diseases.
Collapse
Affiliation(s)
- Alam Gabriel Rojas-López
- Optimal Mechatronic Design Laboratory, Postgraduate Department, Instituto Politécnico Nacional—Centro de Innovación y Desarrollo Tecnológico en Cómputo, Mexico City 07700, Mexico; (A.G.R.-L.); (A.V.U.-A.)
| | | | - Abril Valeria Uriarte-Arcia
- Optimal Mechatronic Design Laboratory, Postgraduate Department, Instituto Politécnico Nacional—Centro de Innovación y Desarrollo Tecnológico en Cómputo, Mexico City 07700, Mexico; (A.G.R.-L.); (A.V.U.-A.)
| | - Miguel Gabriel Villarreal-Cervantes
- Optimal Mechatronic Design Laboratory, Postgraduate Department, Instituto Politécnico Nacional—Centro de Innovación y Desarrollo Tecnológico en Cómputo, Mexico City 07700, Mexico; (A.G.R.-L.); (A.V.U.-A.)
| |
Collapse
|
3
|
Sielemann J, Sielemann K, Brejová B, Vinař T, Chauve C. plASgraph2: using graph neural networks to detect plasmid contigs from an assembly graph. Front Microbiol 2023; 14:1267695. [PMID: 37869681 PMCID: PMC10587606 DOI: 10.3389/fmicb.2023.1267695] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 09/08/2023] [Indexed: 10/24/2023] Open
Abstract
Identification of plasmids from sequencing data is an important and challenging problem related to antimicrobial resistance spread and other One-Health issues. We provide a new architecture for identifying plasmid contigs in fragmented genome assemblies built from short-read data. We employ graph neural networks (GNNs) and the assembly graph to propagate the information from nearby nodes, which leads to more accurate classification, especially for short contigs that are difficult to classify based on sequence features or database searches alone. We trained plASgraph2 on a data set of samples from the ESKAPEE group of pathogens. plASgraph2 either outperforms or performs on par with a wide range of state-of-the-art methods on testing sets of independent ESKAPEE samples and samples from related pathogens. On one hand, our study provides a new accurate and easy to use tool for contig classification in bacterial isolates; on the other hand, it serves as a proof-of-concept for the use of GNNs in genomics. Our software is available at https://github.com/cchauve/plasgraph2 and the training and testing data sets are available at https://github.com/fmfi-compbio/plasgraph2-datasets.
Collapse
Affiliation(s)
- Janik Sielemann
- Computational Biology, Faculty of Biology, Center for Biotechnology & Graduate School Digital Infrastructures for the Life Sciences (DILS), Bielefeld Institute for Bioinformatics Infrastructure, Bielefeld University, Bielefeld, Germany
| | - Katharina Sielemann
- Genetics and Genomics of Plants, Faculty of Biology, Center for Biotechnology & Graduate School Digital Infrastructures for the Life Sciences (DILS), Bielefeld Institute for Bioinformatics Infrastructure, Bielefeld University, Bielefeld, Germany
| | - Broňa Brejová
- Department of Computer Science, Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Bratislava, Slovakia
| | - Tomáš Vinař
- Department of Applied Informatics, Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Bratislava, Slovakia
| | - Cedric Chauve
- Department of Mathematics, Simon Fraser University, Burnaby, BC, Canada
| |
Collapse
|
4
|
Schackart KE, Graham JB, Ponsero AJ, Hurwitz BL. Evaluation of computational phage detection tools for metagenomic datasets. Front Microbiol 2023; 14:1078760. [PMID: 36760501 PMCID: PMC9902911 DOI: 10.3389/fmicb.2023.1078760] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Accepted: 01/09/2023] [Indexed: 01/25/2023] Open
Abstract
Introduction As new computational tools for detecting phage in metagenomes are being rapidly developed, a critical need has emerged to develop systematic benchmarks. Methods In this study, we surveyed 19 metagenomic phage detection tools, 9 of which could be installed and run at scale. Those 9 tools were assessed on several benchmark challenges. Fragmented reference genomes are used to assess the effects of fragment length, low viral content, phage taxonomy, robustness to eukaryotic contamination, and computational resource usage. Simulated metagenomes are used to assess the effects of sequencing and assembly quality on the tool performances. Finally, real human gut metagenomes and viromes are used to assess the differences and similarities in the phage communities predicted by the tools. Results We find that the various tools yield strikingly different results. Generally, tools that use a homology approach (VirSorter, MARVEL, viralVerify, VIBRANT, and VirSorter2) demonstrate low false positive rates and robustness to eukaryotic contamination. Conversely, tools that use a sequence composition approach (VirFinder, DeepVirFinder, Seeker), and MetaPhinder, have higher sensitivity, including to phages with less representation in reference databases. These differences led to widely differing predicted phage communities in human gut metagenomes, with nearly 80% of contigs being marked as phage by at least one tool and a maximum overlap of 38.8% between any two tools. While the results were more consistent among the tools on viromes, the differences in results were still significant, with a maximum overlap of 60.65%. Discussion: Importantly, the benchmark datasets developed in this study are publicly available and reusable to enable the future comparability of new tools developed.
Collapse
Affiliation(s)
- Kenneth E. Schackart
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ, United States
| | - Jessica B. Graham
- BIO5 Institute, The University of Arizona, Tucson, AZ, United States
| | - Alise J. Ponsero
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ, United States
- BIO5 Institute, The University of Arizona, Tucson, AZ, United States
- Human Microbiome Research Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland
| | - Bonnie L. Hurwitz
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ, United States
- BIO5 Institute, The University of Arizona, Tucson, AZ, United States
| |
Collapse
|
5
|
Zou X, Nguyen M, Overbeek J, Cao B, Davis JJ. Classification of bacterial plasmid and chromosome derived sequences using machine learning. PLoS One 2022; 17:e0279280. [PMID: 36525447 PMCID: PMC9757591 DOI: 10.1371/journal.pone.0279280] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 12/02/2022] [Indexed: 12/23/2022] Open
Abstract
Plasmids are important genetic elements that facilitate horizonal gene transfer between bacteria and contribute to the spread of virulence and antimicrobial resistance. Most bacterial genome sequences in the public archives exist in draft form with many contigs, making it difficult to determine if a contig is of chromosomal or plasmid origin. Using a training set of contigs comprising 10,584 chromosomes and 10,654 plasmids from the PATRIC database, we evaluated several machine learning models including random forest, logistic regression, XGBoost, and a neural network for their ability to classify chromosomal and plasmid sequences using nucleotide k-mers as features. Based on the methods tested, a neural network model that used nucleotide 6-mers as features that was trained on randomly selected chromosomal and plasmid subsequences 5kb in length achieved the best performance, outperforming existing out-of-the-box methods, with an average accuracy of 89.38% ± 2.16% over a 10-fold cross validation. The model accuracy can be improved to 92.08% by using a voting strategy when classifying holdout sequences. In both plasmids and chromosomes, subsequences encoding functions involved in horizontal gene transfer-including hypothetical proteins, transporters, phage, mobile elements, and CRISPR elements-were most likely to be misclassified by the model. This study provides a straightforward approach for identifying plasmid-encoding sequences in short read assemblies without the need for sequence alignment-based tools.
Collapse
Affiliation(s)
- Xiaohui Zou
- Laboratory of Clinical Microbiology and Infectious Diseases, Department of Pulmonary and Critical Care Medicine, Center for Respiratory Diseases, China-Japan Friendship Hospital, National Clinical Research Centre for Respiratory Disease, Beijing, China
| | - Marcus Nguyen
- Data Science and Learning Division, Computing Environment and Life Sciences Directorate, Argonne National Laboratory, Lemont, IL, United States of America
- Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, United States of America
| | - Jamie Overbeek
- Data Science and Learning Division, Computing Environment and Life Sciences Directorate, Argonne National Laboratory, Lemont, IL, United States of America
- Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, United States of America
| | - Bin Cao
- Laboratory of Clinical Microbiology and Infectious Diseases, Department of Pulmonary and Critical Care Medicine, Center for Respiratory Diseases, China-Japan Friendship Hospital, National Clinical Research Centre for Respiratory Disease, Beijing, China
- * E-mail: (JJD); (BC)
| | - James J. Davis
- Data Science and Learning Division, Computing Environment and Life Sciences Directorate, Argonne National Laboratory, Lemont, IL, United States of America
- Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, United States of America
- * E-mail: (JJD); (BC)
| |
Collapse
|