1
|
Mechtersheimer D, Ding W, Xu X, Kim S, Sue C, Cao Y, Yang J. IMPACT: interpretable microbial phenotype analysis via microbial characteristic traits. Bioinformatics 2024; 41:btae702. [PMID: 39658259 DOI: 10.1093/bioinformatics/btae702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Revised: 08/23/2024] [Accepted: 12/09/2024] [Indexed: 12/12/2024] Open
Abstract
MOTIVATION The human gut microbiome, consisting of trillions of bacteria, significantly impacts health and disease. High-throughput profiling through the advancement of modern technology provides the potential to enhance our understanding of the link between the microbiome and complex disease outcomes. However, there remains an open challenge where current microbiome models lack interpretability of microbial features, limiting a deeper understanding of the role of the gut microbiome in disease. To address this, we present a framework that combines a feature engineering step to transform tabular abundance data to image format using functional microbial annotation databases, with a residual spatial attention transformer block architecture for phenotype classification. RESULTS Our model, IMPACT, delivers improved predictive accuracy performance across multiclass classification compared to similar methods. More importantly, our approach provides interpretable feature importance through image classification saliency methods. This enables the extraction of taxa markers (features) associated with a disease outcome and also their associated functional microbial traits and metabolites. AVAILABILITY AND IMPLEMENTATION IMPACT is available at https://github.com/SydneyBioX/IMPACT. We providedirect installation of IMPACT via pip.
Collapse
Affiliation(s)
- Daniel Mechtersheimer
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
| | - Wenze Ding
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
| | - Xiangnan Xu
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Sydney, NSW 2006, Australia
- Neuroscience Research Australia, Randwick, NSW 2031, Australia
| | - Sanghyun Kim
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
| | - Carolyn Sue
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
- Neuroscience Research Australia, Randwick, NSW 2031, Australia
| | - Yue Cao
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
- Laboratory of Data Discovery for Health Limited (D24H), New Territories, Hong Kong SAR, China
| | - Jean Yang
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
- Laboratory of Data Discovery for Health Limited (D24H), New Territories, Hong Kong SAR, China
| |
Collapse
|
2
|
Monshizadeh M, Hong Y, Ye Y. Multitask knowledge-primed neural network for predicting missing metadata and host phenotype based on human microbiome. BIOINFORMATICS ADVANCES 2024; 5:vbae203. [PMID: 39735577 PMCID: PMC11676323 DOI: 10.1093/bioadv/vbae203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/25/2024] [Revised: 11/27/2024] [Accepted: 12/11/2024] [Indexed: 12/31/2024]
Abstract
Motivation Microbial signatures in the human microbiome are closely associated with various human diseases, driving the development of machine learning models for microbiome-based disease prediction. Despite progress, challenges remain in enhancing prediction accuracy, generalizability, and interpretability. Confounding factors, such as host's gender, age, and body mass index, significantly influence the human microbiome, complicating microbiome-based predictions. Results To address these challenges, we developed MicroKPNN-MT, a unified model for predicting human phenotype based on microbiome data, as well as additional metadata like age and gender. This model builds on our earlier MicroKPNN framework, which incorporates prior knowledge of microbial species into neural networks to enhance prediction accuracy and interpretability. In MicroKPNN-MT, metadata, when available, serves as additional input features for prediction. Otherwise, the model predicts metadata from microbiome data using additional decoders. We applied MicroKPNN-MT to microbiome data collected in mBodyMap, covering healthy individuals and 25 different diseases, and demonstrated its potential as a predictive tool for multiple diseases, which at the same time provided predictions for the missing metadata. Our results showed that incorporating real or predicted metadata helped improve the accuracy of disease predictions, and more importantly, helped improve the generalizability of the predictive models. Availability and implementation https://github.com/mgtools/MicroKPNN-MT.
Collapse
Affiliation(s)
- Mahsa Monshizadeh
- Computer Science Department, Indiana University, Bloomington, IN 47408, United States
| | - Yuhui Hong
- Computer Science Department, Indiana University, Bloomington, IN 47408, United States
| | - Yuzhen Ye
- Computer Science Department, Indiana University, Bloomington, IN 47408, United States
| |
Collapse
|
3
|
Wang B, Shen Y, Fang J, Su X, Xu ZZ. DeepPhylo: Phylogeny-Aware Microbial Embeddings Enhanced Predictive Accuracy in Human Microbiome Data Analysis. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2404277. [PMID: 39403892 PMCID: PMC11615782 DOI: 10.1002/advs.202404277] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Revised: 09/24/2024] [Indexed: 12/06/2024]
Abstract
Microbial data analysis poses significant challenges due to its high dimensionality, sparsity, and compositionality. Recent advances have shown that integrating abundance and phylogenetic information is an effective strategy for uncovering robust patterns and enhancing the predictive performance in microbiome studies. However, existing methods primarily focus on the hierarchical structure of phylogenetic trees, overlooking the evolutionary distances embedded within them. This study introduces DeepPhylo, a novel method that employs phylogeny-aware amplicon embeddings to effectively integrate abundance and phylogenetic information. DeepPhylo improves both the unsupervised discriminatory power and supervised predictive accuracy of microbiome data analysis. Compared to the existing methods, DeepPhylo demonstrates superiority in informing biologically relevant insights across five real-world microbiome use cases, including clustering of skin microbiomes, prediction of host chronological age and gender, diagnosis of inflammatory bowel disease (IBD) across 15 studies, and multilabel disease classification.
Collapse
Affiliation(s)
- Bin Wang
- School of Mathematics and Computer SciencesNanchang UniversityNanchang330031China
| | - Yulong Shen
- School of Information EngineeringNanchang UniversityNanchang330031China
| | - Jingyan Fang
- School of Mathematics and Computer SciencesNanchang UniversityNanchang330031China
| | - Xiaoquan Su
- College of Computer Science and TechnologyQingdao UniversityQingdao266071China
| | - Zhenjiang Zech Xu
- School of Mathematics and Computer SciencesNanchang UniversityNanchang330031China
- State Key Laboratory of Food Science and TechnologyNanchang UniversityNanchang330077China
| |
Collapse
|
4
|
Bašić-Čičak D, Hasić Telalović J, Pašić L. Utilizing Artificial Intelligence for Microbiome Decision-Making: Autism Spectrum Disorder in Children from Bosnia and Herzegovina. Diagnostics (Basel) 2024; 14:2536. [PMID: 39594202 PMCID: PMC11592508 DOI: 10.3390/diagnostics14222536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2024] [Revised: 10/17/2024] [Accepted: 10/21/2024] [Indexed: 11/28/2024] Open
Abstract
BACKGROUND/OBJECTIVES The study of microbiome composition shows positive indications for application in the diagnosis and treatment of many conditions and diseases. One such condition is autism spectrum disorder (ASD). We aimed to analyze gut microbiome samples from children in Bosnia and Herzegovina to identify microbial differences between neurotypical children and those with ASD. Additionally, we developed machine learning classifiers to differentiate between the two groups using microbial abundance and predicted functional pathways. METHODS A total of 60 gut microbiome samples (16S rRNA sequences) were analyzed, with 44 from children with ASD and 16 from neurotypical children. Four machine learning algorithms (Random Forest, Support Vector Classification, Gradient Boosting, and Extremely Randomized Tree Classifier) were applied to create eight classification models based on bacterial abundance at the genus level and KEGG pathways. Model accuracy was evaluated, and an external dataset was introduced to test model generalizability. RESULTS The highest classification accuracy (80%) was achieved with Random Forest and Extremely Randomized Tree Classifier using genus-level taxa. The Random Forest model also performed well (78%) with KEGG pathways. When tested on an independent dataset, the model maintained high accuracy (79%), confirming its generalizability. CONCLUSIONS This study identified significant microbial differences between neurotypical children and children with ASD. Machine learning classifiers, particularly Random Forest and Extremely Randomized Tree Classifier, achieved strong accuracy. Validation with external data demonstrated that the models could generalize across different datasets, highlighting their potential use.
Collapse
Affiliation(s)
- Džana Bašić-Čičak
- Computer Science Department, University Sarajevo School of Science and Technology, Hrasnička cesta 3a, 71000 Sarajevo, Bosnia and Herzegovina;
| | - Jasminka Hasić Telalović
- Computer Science Department, University Sarajevo School of Science and Technology, Hrasnička cesta 3a, 71000 Sarajevo, Bosnia and Herzegovina;
| | - Lejla Pašić
- Sarajevo Medical School, University Sarajevo School of Science and Technology, Hrasnička cesta 3a, 71000 Sarajevo, Bosnia and Herzegovina;
| |
Collapse
|
5
|
Liu Z, Sun Y, Li Y, Ma A, Willaims NF, Jahanbahkshi S, Hoyd R, Wang X, Zhang S, Zhu J, Xu D, Spakowicz D, Ma Q, Liu B. An Explainable Graph Neural Framework to Identify Cancer-Associated Intratumoral Microbial Communities. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2403393. [PMID: 39225619 PMCID: PMC11538693 DOI: 10.1002/advs.202403393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/01/2024] [Revised: 06/26/2024] [Indexed: 09/04/2024]
Abstract
Microbes are extensively present among various cancer tissues and play critical roles in carcinogenesis and treatment responses. However, the underlying relationships between intratumoral microbes and tumors remain poorly understood. Here, a MIcrobial Cancer-association Analysis using a Heterogeneous graph transformer (MICAH) to identify intratumoral cancer-associated microbial communities is presented. MICAH integrates metabolic and phylogenetic relationships among microbes into a heterogeneous graph representation. It uses a graph transformer to holistically capture relationships between intratumoral microbes and cancer tissues, which improves the explainability of the associations between identified microbial communities and cancers. MICAH is applied to intratumoral bacterial data across 5 cancer types and 5 fungi datasets, and its generalizability and reproducibility are demonstrated. After experimentally testing a representative observation using a mouse model of tumor-microbe-immune interactions, a result consistent with MICAH's identified relationship is observed. Source tracking analysis reveals that the primary known contributor to a cancer-associated microbial community is the organs affected by the type of cancer. Overall, this graph neural network framework refines the number of microbes that can be used for follow-up experimental validation from thousands to tens, thereby helping to accelerate the understanding of the relationship between tumors and intratumoral microbiomes.
Collapse
Affiliation(s)
- Zhaoqian Liu
- School of MathematicsShandong UniversityJinanShandong250100China
- College of SciencesXi'an University of Science and TechnologyXi'anShanxi710054China
| | - Yuhan Sun
- School of MathematicsShandong UniversityJinanShandong250100China
| | - Yingjie Li
- Department of Biomedical InformaticsThe Ohio State UniversityColumbusOH43210USA
| | - Anjun Ma
- Department of Biomedical InformaticsThe Ohio State UniversityColumbusOH43210USA
- Pelotonia Institute for Immuno‐OncologyThe Ohio State UniversityColumbusOH43210USA
| | - Nyelia F. Willaims
- Department of Internal MedicineCollege of MedicineThe Ohio State UniversityColumbusOH43210USA
| | - Shiva Jahanbahkshi
- Department of Food Science and TechnologyCollege of FoodAgricultural, and Environmental SciencesThe Ohio State UniversityColumbusOH43210USA
| | - Rebecca Hoyd
- Department of Internal MedicineCollege of MedicineThe Ohio State UniversityColumbusOH43210USA
| | - Xiaoying Wang
- Department of Biomedical InformaticsThe Ohio State UniversityColumbusOH43210USA
- Pelotonia Institute for Immuno‐OncologyThe Ohio State UniversityColumbusOH43210USA
| | - Shiqi Zhang
- Department of Human SciencesCollege of Education and Human EcologyThe Ohio State UniversityColumbusOH43210USA
| | - Jiangjiang Zhu
- Department of Human SciencesCollege of Education and Human EcologyThe Ohio State UniversityColumbusOH43210USA
| | - Dong Xu
- Department of Electrical Engineering and Computer ScienceUniversity of MissouriColumbiaMO65201USA
- Christopher S. Bond Life Sciences CenterUniversity of MissouriColumbiaMO65201USA
| | - Daniel Spakowicz
- Pelotonia Institute for Immuno‐OncologyThe Ohio State UniversityColumbusOH43210USA
- Department of Internal MedicineCollege of MedicineThe Ohio State UniversityColumbusOH43210USA
| | - Qin Ma
- Department of Biomedical InformaticsThe Ohio State UniversityColumbusOH43210USA
- Pelotonia Institute for Immuno‐OncologyThe Ohio State UniversityColumbusOH43210USA
| | - Bingqiang Liu
- School of MathematicsShandong UniversityJinanShandong250100China
- Shandong National Center for Applied MathematicsJinanShandong250199China
| |
Collapse
|
6
|
Shi K, Liu Q, Ji Q, He Q, Zhao XM. MicroHDF: predicting host phenotypes with metagenomic data using a deep forest-based framework. Brief Bioinform 2024; 25:bbae530. [PMID: 39446191 PMCID: PMC11500453 DOI: 10.1093/bib/bbae530] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 09/25/2024] [Accepted: 10/07/2024] [Indexed: 10/25/2024] Open
Abstract
The gut microbiota plays a vital role in human health, and significant effort has been made to predict human phenotypes, especially diseases, with the microbiota as a promising indicator or predictor with machine learning (ML) methods. However, the accuracy is impacted by a lot of factors when predicting host phenotypes with the metagenomic data, e.g. small sample size, class imbalance, high-dimensional features, etc. To address these challenges, we propose MicroHDF, an interpretable deep learning framework to predict host phenotypes, where a cascade layers of deep forest units is designed for handling sample class imbalance and high dimensional features. The experimental results show that the performance of MicroHDF is competitive with that of existing state-of-the-art methods on 13 publicly available datasets of six different diseases. In particular, it performs best with the area under the receiver operating characteristic curve of 0.9182 ± 0.0098 and 0.9469 ± 0.0076 for inflammatory bowel disease (IBD) and liver cirrhosis, respectively. Our MicroHDF also shows better performance and robustness in cross-study validation. Furthermore, MicroHDF is applied to two high-risk diseases, IBD and autism spectrum disorder, as case studies to identify potential biomarkers. In conclusion, our method provides an effective and reliable prediction of the host phenotype and discovers informative features with biological insights.
Collapse
Affiliation(s)
- Kai Shi
- College of Computer Science and Engineering, Guilin University of Technology, Guilin, Gaungxi 541004, China
- Guangxi Key Laboratory of Embedded Technology and Intelligent Systems, Guilin University of Technology, Guilin, Gaungxi 541004, China
| | - Qiaohui Liu
- College of Computer Science and Engineering, Guilin University of Technology, Guilin, Gaungxi 541004, China
| | - Qingrong Ji
- College of Computer Science and Engineering, Guilin University of Technology, Guilin, Gaungxi 541004, China
| | - Qisheng He
- College of Computer Science and Engineering, Guilin University of Technology, Guilin, Gaungxi 541004, China
| | - Xing-Ming Zhao
- Huzhou Central Hospital, Affiliated Central Hospital Huzhou University, Huzhou, Zhejiang 313000, China
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
| |
Collapse
|
7
|
Jiang C, Yang J, Peng X, Li X. A permutable MLP-like architecture for disease prediction from gut metagenomic data. BMC Bioinformatics 2024; 25:246. [PMID: 39048979 PMCID: PMC11270793 DOI: 10.1186/s12859-024-05856-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Accepted: 07/05/2024] [Indexed: 07/27/2024] Open
Abstract
Metagenomic data plays a crucial role in analyzing the relationship between microbes and diseases. However, the limited number of samples, high dimensionality, and sparsity of metagenomic data pose significant challenges for the application of deep learning in data classification and prediction. Previous studies have shown that utilizing the phylogenetic tree structure to transform metagenomic abundance data into a 2D matrix input for convolutional neural networks (CNNs) improves classification performance. Inspired by the success of a Permutable MLP-like architecture in visual recognition, we propose Metagenomic Permutator (MetaP), which applied the Permutable MLP-like network structure to capture the phylogenetic information of microbes within the 2D matrix formed by phylogenetic tree. Our experiments demonstrate that our model achieved competitive performance compared to other deep neural networks and traditional machine learning, and has good prospects for multi-classification and large sample sizes. Furthermore, we utilize the SHAP (SHapley Additive exPlanations) method to interpret our model predictions, identifying the microbial features that are associated with diseases.
Collapse
Affiliation(s)
- Cong Jiang
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
- National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, China
| | - Jian Yang
- Beijing Key Laboratory of Mental Disorders, National Clinical Research Center for Mental Disorders and National Center for Mental Disorders, Beijing Anding Hospital, Capital Medical University, Beijing, China
- Advanced Innovation Center for Human Brain Protection, Capital Medical University, Beijing, China
| | - Xiaogang Peng
- National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, China.
| | - Xiaozheng Li
- College of Life Sciences and Oceanography, Shenzhen University, Shenzhen, China.
- JCY Biotech Ltd., Pingshan Translational Medicine Center, Shenzhen Bay Laboratory, Shenzhen, China.
| |
Collapse
|
8
|
Teixeira M, Silva F, Ferreira RM, Pereira T, Figueiredo C, Oliveira HP. A review of machine learning methods for cancer characterization from microbiome data. NPJ Precis Oncol 2024; 8:123. [PMID: 38816569 PMCID: PMC11139966 DOI: 10.1038/s41698-024-00617-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Accepted: 05/17/2024] [Indexed: 06/01/2024] Open
Abstract
Recent studies have shown that the microbiome can impact cancer development, progression, and response to therapies suggesting microbiome-based approaches for cancer characterization. As cancer-related signatures are complex and implicate many taxa, their discovery often requires Machine Learning approaches. This review discusses Machine Learning methods for cancer characterization from microbiome data. It focuses on the implications of choices undertaken during sample collection, feature selection and pre-processing. It also discusses ML model selection, guiding how to choose an ML model, and model validation. Finally, it enumerates current limitations and how these may be surpassed. Proposed methods, often based on Random Forests, show promising results, however insufficient for widespread clinical usage. Studies often report conflicting results mainly due to ML models with poor generalizability. We expect that evaluating models with expanded, hold-out datasets, removing technical artifacts, exploring representations of the microbiome other than taxonomical profiles, leveraging advances in deep learning, and developing ML models better adapted to the characteristics of microbiome data will improve the performance and generalizability of models and enable their usage in the clinic.
Collapse
Affiliation(s)
- Marco Teixeira
- Institute for Systems and Computer Engineering, Technology and Science, Porto, Portugal.
- Faculty of Engineering, University of Porto, Porto, Portugal.
| | - Francisco Silva
- Institute for Systems and Computer Engineering, Technology and Science, Porto, Portugal
- Faculty of Science, University of Porto, Porto, Portugal
| | - Rui M Ferreira
- Ipatimup - Institute of Molecular Pathology and Immunology of the University of Porto, Porto, Portugal
- Instituto de Investigação e Inovação em Saúde, University of Porto, Porto, Portugal
| | - Tania Pereira
- Institute for Systems and Computer Engineering, Technology and Science, Porto, Portugal
- Faculty of Sciences and Technology, University of Coimbra, Coimbra, Portugal
| | - Ceu Figueiredo
- Ipatimup - Institute of Molecular Pathology and Immunology of the University of Porto, Porto, Portugal
- Instituto de Investigação e Inovação em Saúde, University of Porto, Porto, Portugal
- Faculty of Medicine, University of Porto, Porto, Portugal
| | - Hélder P Oliveira
- Institute for Systems and Computer Engineering, Technology and Science, Porto, Portugal
- Faculty of Science, University of Porto, Porto, Portugal
| |
Collapse
|
9
|
Shtossel O, Finkelstein S, Louzoun Y. mi-Mic: a novel multi-layer statistical test for microbiota-disease associations. Genome Biol 2024; 25:113. [PMID: 38693546 PMCID: PMC11064322 DOI: 10.1186/s13059-024-03256-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Accepted: 04/22/2024] [Indexed: 05/03/2024] Open
Abstract
mi-Mic, a novel approach for microbiome differential abundance analysis, tackles the key challenges of such statistical tests: a large number of tests, sparsity, varying abundance scales, and taxonomic relationships. mi-Mic first converts microbial counts to a cladogram of means. It then applies a priori tests on the upper levels of the cladogram to detect overall relationships. Finally, it performs a Mann-Whitney test on paths that are consistently significant along the cladogram or on the leaves. mi-Mic has much higher true to false positives ratios than existing tests, as measured by a new real-to-shuffle positive score.
Collapse
Affiliation(s)
- Oshrit Shtossel
- Department of Mathematics, Bar-Ilan University, Ramat Gan, 52900, Israel
| | - Shani Finkelstein
- Department of Mathematics, Bar-Ilan University, Ramat Gan, 52900, Israel
| | - Yoram Louzoun
- Department of Mathematics, Bar-Ilan University, Ramat Gan, 52900, Israel.
| |
Collapse
|
10
|
Yi X, He Y, Gao S, Li M. A review of the application of deep learning in obesity: From early prediction aid to advanced management assistance. Diabetes Metab Syndr 2024; 18:103000. [PMID: 38604060 DOI: 10.1016/j.dsx.2024.103000] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/08/2022] [Revised: 01/23/2024] [Accepted: 03/29/2024] [Indexed: 04/13/2024]
Abstract
BACKGROUND AND AIMS Obesity is a chronic disease which can cause severe metabolic disorders. Machine learning (ML) techniques, especially deep learning (DL), have proven to be useful in obesity research. However, there is a dearth of systematic reviews of DL applications in obesity. This article aims to summarize the current trend of DL usage in obesity research. METHODS An extensive literature review was carried out across multiple databases, including PubMed, Embase, Web of Science, Scopus, and Medline, to collate relevant studies published from January 2018 to September 2023. The focus was on research detailing the application of DL in the context of obesity. We have distilled critical insights pertaining to the utilized learning models, encompassing aspects of their development, principal results, and foundational methodologies. RESULTS Our analysis culminated in the synthesis of new knowledge regarding the application of DL in the context of obesity. Finally, 40 research articles were included. The final collection of these research can be divided into three categories: obesity prediction (n = 16); obesity management (n = 13); and body fat estimation (n = 11). CONCLUSIONS This is the first review to examine DL applications in obesity. It reveals DL's superiority in obesity prediction over traditional ML methods, showing promise for multi-omics research. DL also innovates in obesity management through diet, fitness, and environmental analyses. Additionally, DL improves body fat estimation, offering affordable and precise monitoring tools. The study is registered with PROSPERO (ID: CRD42023475159).
Collapse
Affiliation(s)
- Xinghao Yi
- Department of Endocrinology, NHC Key Laboratory of Endocrinology, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing 100730, China
| | - Yangzhige He
- Department of Medical Research Center, Peking Union Medical College Hospital, Chinese Academy of Medical Science & Peking Union Medical College, Beijing 100730, China
| | - Shan Gao
- Department of Endocrinology, Xuan Wu Hospital, Capital Medical University, Beijing 10053, China
| | - Ming Li
- Department of Endocrinology, NHC Key Laboratory of Endocrinology, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing 100730, China.
| |
Collapse
|
11
|
Roy G, Prifti E, Belda E, Zucker JD. Deep learning methods in metagenomics: a review. Microb Genom 2024; 10:001231. [PMID: 38630611 PMCID: PMC11092122 DOI: 10.1099/mgen.0.001231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Accepted: 03/27/2024] [Indexed: 04/19/2024] Open
Abstract
The ever-decreasing cost of sequencing and the growing potential applications of metagenomics have led to an unprecedented surge in data generation. One of the most prevalent applications of metagenomics is the study of microbial environments, such as the human gut. The gut microbiome plays a crucial role in human health, providing vital information for patient diagnosis and prognosis. However, analysing metagenomic data remains challenging due to several factors, including reference catalogues, sparsity and compositionality. Deep learning (DL) enables novel and promising approaches that complement state-of-the-art microbiome pipelines. DL-based methods can address almost all aspects of microbiome analysis, including novel pathogen detection, sequence classification, patient stratification and disease prediction. Beyond generating predictive models, a key aspect of these methods is also their interpretability. This article reviews DL approaches in metagenomics, including convolutional networks, autoencoders and attention-based models. These methods aggregate contextualized data and pave the way for improved patient care and a better understanding of the microbiome's key role in our health.
Collapse
Affiliation(s)
- Gaspar Roy
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
| | - Edi Prifti
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
- Sorbonne University, INSERM, Nutriomics, 91 bvd de l’hopital, 75013 Paris, France
| | - Eugeni Belda
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
- Sorbonne University, INSERM, Nutriomics, 91 bvd de l’hopital, 75013 Paris, France
| | - Jean-Daniel Zucker
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
- Sorbonne University, INSERM, Nutriomics, 91 bvd de l’hopital, 75013 Paris, France
| |
Collapse
|
12
|
Shtossel O, Koren O, Shai I, Rinott E, Louzoun Y. Gut microbiome-metabolome interactions predict host condition. MICROBIOME 2024; 12:24. [PMID: 38336867 PMCID: PMC10858481 DOI: 10.1186/s40168-023-01737-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Accepted: 12/10/2023] [Indexed: 02/12/2024]
Abstract
BACKGROUND The effect of microbes on their human host is often mediated through changes in metabolite concentrations. As such, multiple tools have been proposed to predict metabolite concentrations from microbial taxa frequencies. Such tools typically fail to capture the dependence of the microbiome-metabolite relation on the environment. RESULTS We propose to treat the microbiome-metabolome relation as the equilibrium of a complex interaction and to relate the host condition to a latent representation of the interaction between the log concentration of the metabolome and the log frequencies of the microbiome. We develop LOCATE (Latent variables Of miCrobiome And meTabolites rElations), a machine learning tool to predict the metabolite concentration from the microbiome composition and produce a latent representation of the interaction. This representation is then used to predict the host condition. LOCATE's accuracy in predicting the metabolome is higher than all current predictors. The metabolite concentration prediction accuracy significantly decreases cross datasets, and cross conditions, especially in 16S data. LOCATE's latent representation predicts the host condition better than either the microbiome or the metabolome. This representation is strongly correlated with host demographics. A significant improvement in accuracy (0.793 vs. 0.724 average accuracy) is obtained even with a small number of metabolite samples ([Formula: see text]). CONCLUSION These results suggest that a latent representation of the microbiome-metabolome interaction leads to a better association with the host condition than any of the two separated or the simple combination of the two. Video Abstract.
Collapse
Affiliation(s)
- Oshrit Shtossel
- Department of Mathematics, Bar-Ilan University, Ramat Gan, 52900, Israel
| | - Omry Koren
- The Azrieli Faculty of Medicine, Bar-Ilan University, Safed, Israel
| | - Iris Shai
- Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - Ehud Rinott
- Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - Yoram Louzoun
- Department of Mathematics, Bar-Ilan University, Ramat Gan, 52900, Israel.
| |
Collapse
|
13
|
Wang C, Ma A, Li Y, McNutt ME, Zhang S, Zhu J, Hoyd R, Wheeler CE, Robinson LA, Chan CH, Zakharia Y, Dodd RD, Ulrich CM, Hardikar S, Churchman ML, Tarhini AA, Singer EA, Ikeguchi AP, McCarter MD, Denko N, Tinoco G, Husain M, Jin N, Osman AE, Eljilany I, Tan AC, Coleman SS, Denko L, Riedlinger G, Schneider BP, Spakowicz D, Ma Q. A Bioinformatics Tool for Identifying Intratumoral Microbes from the ORIEN Dataset. CANCER RESEARCH COMMUNICATIONS 2024; 4:293-302. [PMID: 38259095 PMCID: PMC10840455 DOI: 10.1158/2767-9764.crc-23-0213] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/12/2023] [Revised: 09/26/2023] [Accepted: 01/04/2024] [Indexed: 01/24/2024]
Abstract
Evidence supports significant interactions among microbes, immune cells, and tumor cells in at least 10%-20% of human cancers, emphasizing the importance of further investigating these complex relationships. However, the implications and significance of tumor-related microbes remain largely unknown. Studies have demonstrated the critical roles of host microbes in cancer prevention and treatment responses. Understanding interactions between host microbes and cancer can drive cancer diagnosis and microbial therapeutics (bugs as drugs). Computational identification of cancer-specific microbes and their associations is still challenging due to the high dimensionality and high sparsity of intratumoral microbiome data, which requires large datasets containing sufficient event observations to identify relationships, and the interactions within microbial communities, the heterogeneity in microbial composition, and other confounding effects that can lead to spurious associations. To solve these issues, we present a bioinformatics tool, microbial graph attention (MEGA), to identify the microbes most strongly associated with 12 cancer types. We demonstrate its utility on a dataset from a consortium of nine cancer centers in the Oncology Research Information Exchange Network. This package has three unique features: species-sample relations are represented in a heterogeneous graph and learned by a graph attention network; it incorporates metabolic and phylogenetic information to reflect intricate relationships within microbial communities; and it provides multiple functionalities for association interpretations and visualizations. We analyzed 2,704 tumor RNA sequencing samples and MEGA interpreted the tissue-resident microbial signatures of each of 12 cancer types. MEGA can effectively identify cancer-associated microbial signatures and refine their interactions with tumors. SIGNIFICANCE Studying the tumor microbiome in high-throughput sequencing data is challenging because of the extremely sparse data matrices, heterogeneity, and high likelihood of contamination. We present a new deep learning tool, MEGA, to refine the organisms that interact with tumors.
Collapse
Affiliation(s)
- Cankun Wang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, Ohio
| | - Anjun Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, Ohio
- Pelotonia Institute for Immuno-Oncology, The Ohio State University Comprehensive Cancer Center, Columbus, Ohio
| | - Yingjie Li
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, Ohio
| | - Megan E. McNutt
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, Ohio
| | - Shiqi Zhang
- Department of Human Sciences, College of Education and Human Ecology, The Ohio State University, Columbus, Ohio
| | - Jiangjiang Zhu
- Department of Human Sciences, College of Education and Human Ecology, The Ohio State University, Columbus, Ohio
| | - Rebecca Hoyd
- Division of Medical Oncology, Department of Internal Medicine, The Ohio State University Comprehensive Cancer Center, Columbus, Ohio
| | - Caroline E. Wheeler
- Division of Medical Oncology, Department of Internal Medicine, The Ohio State University Comprehensive Cancer Center, Columbus, Ohio
| | - Lary A. Robinson
- Department of Thoracic Oncology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida
| | - Carlos H.F. Chan
- University of Iowa, Holden Comprehensive Cancer Center, Iowa City, Iowa
| | - Yousef Zakharia
- Division of Oncology, Hematology and Blood & Marrow Transplantation, University of Iowa, Holden Comprehensive Cancer Center, Iowa City, Iowa
| | - Rebecca D. Dodd
- Department of Internal Medicine, University of Iowa, Iowa City, Iowa
| | - Cornelia M. Ulrich
- Department of Population Health Sciences, Huntsman Cancer Institute, University of Utah, Salt Lake City, Utah
| | - Sheetal Hardikar
- Department of Population Health Sciences, Huntsman Cancer Institute, University of Utah, Salt Lake City, Utah
| | | | - Ahmad A. Tarhini
- Departments of Cutaneous Oncology and Immunology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida
| | - Eric A. Singer
- Department of Urologic Oncology, The Ohio State University Comprehensive Cancer Center, Columbus, Ohio
| | - Alexandra P. Ikeguchi
- Department of Hematology/Oncology, Stephenson Cancer Center of University of Oklahoma, Oklahoma City, Oklahoma
| | - Martin D. McCarter
- Department of Surgery, University of Colorado School of Medicine, Aurora, Colorado
| | - Nicholas Denko
- Department of Radiation Oncology, The Ohio State University Comprehensive Cancer Center, Columbus, Ohio
| | - Gabriel Tinoco
- Division of Medical Oncology, Department of Internal Medicine, The Ohio State University Comprehensive Cancer Center, Columbus, Ohio
| | - Marium Husain
- Division of Medical Oncology, Department of Internal Medicine, The Ohio State University Comprehensive Cancer Center, Columbus, Ohio
| | - Ning Jin
- Division of Medical Oncology, Department of Internal Medicine, The Ohio State University Comprehensive Cancer Center, Columbus, Ohio
| | - Afaf E.G. Osman
- Department of Internal Medicine, University of Utah, Salt Lake City, Utah
| | - Islam Eljilany
- Clinical Science Lab – Cutaneous Oncology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida
| | - Aik Choon Tan
- Departments of Oncological Science and Biomedical Informatics, Huntsman Cancer Institute, University of Utah, Salt Lake City, Utah
| | - Samuel S. Coleman
- Departments of Oncological Science and Biomedical Informatics, Huntsman Cancer Institute, University of Utah, Salt Lake City, Utah
| | - Louis Denko
- Pelotonia Institute for Immuno-Oncology, The Ohio State University Comprehensive Cancer Center, Columbus, Ohio
- Division of Medical Oncology, Department of Internal Medicine, The Ohio State University Comprehensive Cancer Center, Columbus, Ohio
| | - Gregory Riedlinger
- Department of Precision Medicine, Rutgers Cancer Institute of New Jersey, New Brunswick, New Jersey
| | - Bryan P. Schneider
- Indiana University Simon Comprehensive Cancer Center, Indianapolis, Indiana
| | - Daniel Spakowicz
- Pelotonia Institute for Immuno-Oncology, The Ohio State University Comprehensive Cancer Center, Columbus, Ohio
- Division of Medical Oncology, Department of Internal Medicine, The Ohio State University Comprehensive Cancer Center, Columbus, Ohio
| | - Qin Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, Ohio
- Pelotonia Institute for Immuno-Oncology, The Ohio State University Comprehensive Cancer Center, Columbus, Ohio
| | | |
Collapse
|
14
|
Xu H, Wang T, Miao Y, Qian M, Yang Y, Wang S. MK-BMC: a Multi-Kernel framework with Boosted distance metrics for Microbiome data for Classification. Bioinformatics 2024; 40:btad757. [PMID: 38200571 PMCID: PMC10789312 DOI: 10.1093/bioinformatics/btad757] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 10/30/2023] [Accepted: 01/09/2024] [Indexed: 01/12/2024] Open
Abstract
MOTIVATION Research on human microbiome has suggested associations with human health, opening opportunities to predict health outcomes using microbiome. Studies have also suggested that diverse forms of taxa such as rare taxa that are evolutionally related and abundant taxa that are evolutionally unrelated could be associated with or predictive of a health outcome. Although prediction models were developed for microbiome data, no prediction models currently exist that use multiple forms of microbiome-outcome associations. RESULTS We developed MK-BMC, a Multi-Kernel framework with Boosted distance Metrics for Classification using microbiome data. We propose to first boost widely used distance metrics for microbiome data using taxon-level association signal strengths to up-weight taxa that are potentially associated with an outcome of interest. We then propose a multi-kernel prediction model with one kernel capturing one form of association between taxa and the outcome, where a kernel measures similarities of microbiome compositions between pairs of samples being transformed from a proposed boosted distance metric. We demonstrated superior prediction performance of (i) boosted distance metrics for microbiome data over original ones and (ii) MK-BMC over competing methods through extensive simulations. We applied MK-BMC to predict thyroid, obesity, and inflammatory bowel disease status using gut microbiome data from the American Gut Project and observed much-improved prediction performance over that of competing methods. The learned kernel weights help us understand contributions of individual microbiome signal forms nicely. AVAILABILITY AND IMPLEMENTATION Source code together with a sample input dataset is available at https://github.com/HXu06/MK-BMC.
Collapse
Affiliation(s)
- Huang Xu
- Department of Statistics and Finance, University of Science and Technology of China, Hefei 230026, China
| | - Tian Wang
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY 10032, United States
| | - Yuqi Miao
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY 10032, United States
| | - Min Qian
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY 10032, United States
| | - Yaning Yang
- Department of Statistics and Finance, University of Science and Technology of China, Hefei 230026, China
| | - Shuang Wang
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY 10032, United States
| |
Collapse
|
15
|
Monshizadeh M, Ye Y. Incorporating metabolic activity, taxonomy and community structure to improve microbiome-based predictive models for host phenotype prediction. Gut Microbes 2024; 16:2302076. [PMID: 38214657 PMCID: PMC10793686 DOI: 10.1080/19490976.2024.2302076] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 01/02/2024] [Indexed: 01/13/2024] Open
Abstract
We developed MicroKPNN, a prior-knowledge guided interpretable neural network for microbiome-based human host phenotype prediction. The prior knowledge used in MicroKPNN includes the metabolic activities of different bacterial species, phylogenetic relationships, and bacterial community structure, all in a shallow neural network. Application of MicroKPNN to seven gut microbiome datasets (involving five different human diseases including inflammatory bowel disease, type 2 diabetes, liver cirrhosis, colorectal cancer, and obesity) shows that incorporation of the prior knowledge helped improve the microbiome-based host phenotype prediction. MicroKPNN outperformed fully connected neural network-based approaches in all seven cases, with the most improvement of accuracy in the prediction of type 2 diabetes. MicroKPNN outperformed a recently developed deep-learning based approach DeepMicro, which selects the best combination of autoencoder and machine learning approach to make predictions, in all of the seven cases. Importantly, we showed that MicroKPNN provides a way for interpretation of the predictive models. Using importance scores estimated for the hidden nodes, MicroKPNN could provide explanations for prior research findings by highlighting the roles of specific microbiome components in phenotype predictions. In addition, it may suggest potential future research directions for studying the impacts of microbiome on host health and diseases. MicroKPNN is publicly available at https://github.com/mgtools/MicroKPNN.
Collapse
Affiliation(s)
- Mahsa Monshizadeh
- Computer Science Department, Luddy School of Informatics, Computing and Engineering, Indiana University, Bloomington, IN, USA
| | - Yuzhen Ye
- Computer Science Department, Luddy School of Informatics, Computing and Engineering, Indiana University, Bloomington, IN, USA
| |
Collapse
|
16
|
Liao H, Shang J, Sun Y. GDmicro: classifying host disease status with GCN and deep adaptation network based on the human gut microbiome data. Bioinformatics 2023; 39:btad747. [PMID: 38085234 PMCID: PMC10749762 DOI: 10.1093/bioinformatics/btad747] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Revised: 11/16/2023] [Accepted: 12/11/2023] [Indexed: 12/27/2023] Open
Abstract
MOTIVATION With advances in metagenomic sequencing technologies, there are accumulating studies revealing the associations between the human gut microbiome and some human diseases. These associations shed light on using gut microbiome data to distinguish case and control samples of a specific disease, which is also called host disease status classification. Importantly, using learning-based models to distinguish the disease and control samples is expected to identify important biomarkers more accurately than abundance-based statistical analysis. However, available tools have not fully addressed two challenges associated with this task: limited labeled microbiome data and decreased accuracy in cross-studies. The confounding factors, such as the diet, technical biases in sample collection/sequencing across different studies/cohorts often jeopardize the generalization of the learning model. RESULTS To address these challenges, we develop a new tool GDmicro, which combines semi-supervised learning and domain adaptation to achieve a more generalized model using limited labeled samples. We evaluated GDmicro on human gut microbiome data from 11 cohorts covering 5 different diseases. The results show that GDmicro has better performance and robustness than state-of-the-art tools. In particular, it improves the AUC from 0.783 to 0.949 in identifying inflammatory bowel disease. Furthermore, GDmicro can identify potential biomarkers with greater accuracy than abundance-based statistical analysis methods. It also reveals the contribution of these biomarkers to the host's disease status. AVAILABILITY AND IMPLEMENTATION https://github.com/liaoherui/GDmicro.
Collapse
Affiliation(s)
- Herui Liao
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong (SAR), 518057, China
| | - Jiayu Shang
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong (SAR), 518057, China
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong (SAR), 518057, China
| |
Collapse
|
17
|
Angelova IY, Kovtun AS, Averina OV, Koshenko TA, Danilenko VN. Unveiling the Connection between Microbiota and Depressive Disorder through Machine Learning. Int J Mol Sci 2023; 24:16459. [PMID: 38003647 PMCID: PMC10671666 DOI: 10.3390/ijms242216459] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 11/13/2023] [Accepted: 11/15/2023] [Indexed: 11/26/2023] Open
Abstract
In the last few years, investigation of the gut-brain axis and the connection between the gut microbiota and the human nervous system and mental health has become one of the most popular topics. Correlations between the taxonomic and functional changes in gut microbiota and major depressive disorder have been shown in several studies. Machine learning provides a promising approach to analyze large-scale metagenomic data and identify biomarkers associated with depression. In this work, machine learning algorithms, such as random forest, elastic net, and You Only Look Once (YOLO), were utilized to detect significant features in microbiome samples and classify individuals based on their disorder status. The analysis was conducted on metagenomic data obtained during the study of gut microbiota of healthy people and patients with major depressive disorder. The YOLO method showed the greatest effectiveness in the analysis of the metagenomic samples and confirmed the experimental results on the critical importance of a reduction in the amount of Faecalibacterium prausnitzii for the manifestation of depression. These findings could contribute to a better understanding of the role of the gut microbiota in major depressive disorder and potentially lead the way for novel diagnostic and therapeutic strategies.
Collapse
Affiliation(s)
- Irina Y. Angelova
- Vavilov Institute of General Genetics, Russian Academy of Sciences (RAS), 119333 Moscow, Russia; (A.S.K.); (O.V.A.); (V.N.D.)
| | | | | | | | | |
Collapse
|
18
|
Park H, Lim SJ, Cosme J, O'Connell K, Sandeep J, Gayanilo F, Cutter Jr. GR, Montes E, Nitikitpaiboon C, Fisher S, Moustahfid H, Thompson LR. Investigation of machine learning algorithms for taxonomic classification of marine metagenomes. Microbiol Spectr 2023; 11:e0523722. [PMID: 37695074 PMCID: PMC10580933 DOI: 10.1128/spectrum.05237-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Accepted: 06/30/2023] [Indexed: 09/12/2023] Open
Abstract
IMPORTANCE Taxonomic profiling of microbial communities is essential to model microbial interactions and inform habitat conservation. This work develops approaches in constructing training/testing data sets from publicly available marine metagenomes and evaluates the performance of machine learning (ML) approaches in read-based taxonomic classification of marine metagenomes. Predictions from two models are used to test accuracy in metagenomic classification and to guide improvements in ML approaches. Our study provides insights on the methods, results, and challenges of deep learning on marine microbial metagenomic data sets. Future machine learning approaches can be improved by rectifying genome coverage and class imbalance in the training data sets, developing alternative models, and increasing the accessibility of computational resources for model training and refinement.
Collapse
Affiliation(s)
- Helen Park
- Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua-Peking Center for Life Sciences, Tsinghua University, Beijing, China
- EPSRC/BBSRC Future Biomanufacturing Research Hub, EPSRC Synthetic Biology Research Centre SYNBIOCHEM Manchester Institute of Biotechnology and School of Chemistry, The University of Manchester, Manchester, United Kingdom
| | - Shen Jean Lim
- Cooperative Institute for Marine and Atmospheric Studies, Rosenstiel School of Marine, Atmospheric, and Earth Science, University of Miami, Miami, Florida, USA
- Ocean Chemistry and Ecosystems Division, Atlantic Oceanographic and Meteorological Laboratory, National Oceanic and Atmospheric Administration, Miami, Florida, USA
- College of Marine Science, University of South Florida, St Petersburg, Florida, USA
| | | | - Kyle O'Connell
- Deloitte Consulting LLP, Biomedical Data Science Team, Arlington, Virginia, USA
- Department of Vertebrate Zoology, National Museum of Natural History, Smithsonian Institution, Northwest, Washington, DC, USA
| | - Jilla Sandeep
- Harte Research Institute, Texas A&M University-Corpus Christi, Corpus Christi, Texas, USA
| | - Felimon Gayanilo
- Harte Research Institute, Texas A&M University-Corpus Christi, Corpus Christi, Texas, USA
| | - George R. Cutter Jr.
- Southwest Fisheries Science Center, Antarctic Ecosystem Research Division, National Oceanic and Atmospheric Administration, La Jolla, California, USA
| | - Enrique Montes
- Cooperative Institute for Marine and Atmospheric Studies, Rosenstiel School of Marine, Atmospheric, and Earth Science, University of Miami, Miami, Florida, USA
- Ocean Chemistry and Ecosystems Division, Atlantic Oceanographic and Meteorological Laboratory, National Oceanic and Atmospheric Administration, Miami, Florida, USA
| | - Chotinan Nitikitpaiboon
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan
| | - Sam Fisher
- Deloitte Consulting LLP, Biomedical Data Science Team, Arlington, Virginia, USA
| | - Hassan Moustahfid
- NOAA/US Integrated Ocean Observing System (IOOS), Silver Spring, Maryland, USA
| | - Luke R. Thompson
- Ocean Chemistry and Ecosystems Division, Atlantic Oceanographic and Meteorological Laboratory, National Oceanic and Atmospheric Administration, Miami, Florida, USA
- Northern Gulf Institute, Mississippi State University, Mississippi, USA
| |
Collapse
|
19
|
Cui Z, Wu Y, Zhang QH, Wang SG, He Y, Huang DS. MV-CVIB: a microbiome-based multi-view convolutional variational information bottleneck for predicting metastatic colorectal cancer. Front Microbiol 2023; 14:1238199. [PMID: 37675425 PMCID: PMC10477591 DOI: 10.3389/fmicb.2023.1238199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2023] [Accepted: 08/02/2023] [Indexed: 09/08/2023] Open
Abstract
Introduction Imbalances in gut microbes have been implied in many human diseases, including colorectal cancer (CRC), inflammatory bowel disease, type 2 diabetes, obesity, autism, and Alzheimer's disease. Compared with other human diseases, CRC is a gastrointestinal malignancy with high mortality and a high probability of metastasis. However, current studies mainly focus on the prediction of colorectal cancer while neglecting the more serious malignancy of metastatic colorectal cancer (mCRC). In addition, high dimensionality and small samples lead to the complexity of gut microbial data, which increases the difficulty of traditional machine learning models. Methods To address these challenges, we collected and processed 16S rRNA data and calculated abundance data from patients with non-metastatic colorectal cancer (non-mCRC) and mCRC. Different from the traditional health-disease classification strategy, we adopted a novel disease-disease classification strategy and proposed a microbiome-based multi-view convolutional variational information bottleneck (MV-CVIB). Results The experimental results show that MV-CVIB can effectively predict mCRC. This model can achieve AUC values above 0.9 compared to other state-of-the-art models. Not only that, MV-CVIB also achieved satisfactory predictive performance on multiple published CRC gut microbiome datasets. Discussion Finally, multiple gut microbiota analyses were used to elucidate communities and differences between mCRC and non-mCRC, and the metastatic properties of CRC were assessed by patient age and microbiota expression.
Collapse
Affiliation(s)
- Zhen Cui
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Yan Wu
- College of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Qin-Hu Zhang
- EIT Institute for Advanced Study, Ningbo, Zhejiang, China
| | - Si-Guo Wang
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Ying He
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, China
| | | |
Collapse
|
20
|
Liu Y, Zhang YZ, Imoto S. Microbial Gene Ontology informed deep neural network for microbe functionality discovery in human diseases. PLoS One 2023; 18:e0290307. [PMID: 37603579 PMCID: PMC10441785 DOI: 10.1371/journal.pone.0290307] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Accepted: 08/04/2023] [Indexed: 08/23/2023] Open
Abstract
The human microbiome plays a crucial role in human health and is associated with a number of human diseases. Determining microbiome functional roles in human diseases remains a biological challenge due to the high dimensionality of metagenome gene features. However, existing models were limited in providing biological interpretability, where the functional role of microbes in human diseases is unexplored. Here we propose to utilize a neural network-based model incorporating Gene Ontology (GO) relationship network to discover the microbe functionality in human diseases. We use four benchmark datasets, including diabetes, liver cirrhosis, inflammatory bowel disease, and colorectal cancer, to explore the microbe functionality in the human diseases. Our model discovered and visualized the novel candidates' important microbiome genes and their functions by calculating the important score of each gene and GO term in the network. Furthermore, we demonstrate that our model achieves a competitive performance in predicting the disease by comparison with other non-Gene Ontology informed models. The discovered candidates' important microbiome genes and their functions provide novel insights into microbe functional contribution.
Collapse
Affiliation(s)
- Yunjie Liu
- Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| | - Yao-zhong Zhang
- Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| | - Seiya Imoto
- Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
21
|
Li B, Wang T, Qian M, Wang S. MKMR: a multi-kernel machine regression model to predict health outcomes using human microbiome data. Brief Bioinform 2023; 24:7142722. [PMID: 37099694 DOI: 10.1093/bib/bbad158] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Revised: 03/24/2023] [Accepted: 04/03/2023] [Indexed: 04/28/2023] Open
Abstract
Studies have found that human microbiome is associated with and predictive of human health and diseases. Many statistical methods developed for microbiome data focus on different distance metrics that can capture various information in microbiomes. Prediction models were also developed for microbiome data, including deep learning methods with convolutional neural networks that consider both taxa abundance profiles and taxonomic relationships among microbial taxa from a phylogenetic tree. Studies have also suggested that a health outcome could associate with multiple forms of microbiome profiles. In addition to the abundance of some taxa that are associated with a health outcome, the presence/absence of some taxa is also associated with and predictive of the same health outcome. Moreover, associated taxa may be close to each other on a phylogenetic tree or spread apart on a phylogenetic tree. No prediction models currently exist that use multiple forms of microbiome-outcome associations. To address this, we propose a multi-kernel machine regression (MKMR) method that is able to capture various types of microbiome signals when doing predictions. MKMR utilizes multiple forms of microbiome signals through multiple kernels being transformed from multiple distance metrics for microbiomes and learn an optimal conic combination of these kernels, with kernel weights helping us understand contributions of individual microbiome signal types. Simulation studies suggest a much-improved prediction performance over competing methods with mixture of microbiome signals. Real data applicants to predict multiple health outcomes using throat and gut microbiome data also suggest a better prediction of MKMR than that of competing methods.
Collapse
Affiliation(s)
- Bing Li
- Department of Biostatistics, School of Public Health, Brown University, Providence, Rhode Island, U.S.A
| | - Tian Wang
- Department of Biostatistics, Mailman School of Public Health, Columbia University, 722 West 168th Street, New York, New York, 10032 U.S.A
| | - Min Qian
- Department of Biostatistics, Mailman School of Public Health, Columbia University, 722 West 168th Street, New York, New York, 10032 U.S.A
| | - Shuang Wang
- Department of Biostatistics, School of Public Health, Brown University, Providence, Rhode Island, U.S.A
| |
Collapse
|
22
|
Syama K, Jothi JAA, Khanna N. Automatic disease prediction from human gut metagenomic data using boosting GraphSAGE. BMC Bioinformatics 2023; 24:126. [PMID: 37003965 PMCID: PMC10067187 DOI: 10.1186/s12859-023-05251-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2023] [Accepted: 03/23/2023] [Indexed: 04/03/2023] Open
Abstract
BACKGROUND The human microbiome plays a critical role in maintaining human health. Due to the recent advances in high-throughput sequencing technologies, the microbiome profiles present in the human body have become publicly available. Hence, many works have been done to analyze human microbiome profiles. These works have identified that different microbiome profiles are present in healthy and sick individuals for different diseases. Recently, several computational methods have utilized the microbiome profiles to automatically diagnose and classify the host phenotype. RESULTS In this work, a novel deep learning framework based on boosting GraphSAGE is proposed for automatic prediction of diseases from metagenomic data. The proposed framework has two main components, (a). Metagenomic Disease graph (MD-graph) construction module, (b). Disease prediction Network (DP-Net) module. The graph construction module constructs a graph by considering each metagenomic sample as a node in the graph. The graph captures the relationship between the samples using a proximity measure. The DP-Net consists of a boosting GraphSAGE model which predicts the status of a sample as sick or healthy. The effectiveness of the proposed method is verified using real and synthetic datasets corresponding to diseases like inflammatory bowel disease and colorectal cancer. The proposed model achieved a highest AUC of 93%, Accuracy of 95%, F1-score of 95%, AUPRC of 95% for the real inflammatory bowel disease dataset and a best AUC of 90%, Accuracy of 91%, F1-score of 87% and AUPRC of 93% for the real colorectal cancer dataset. CONCLUSION The proposed framework outperforms other machine learning and deep learning models in terms of classification accuracy, AUC, F1-score and AUPRC for both synthetic and real metagenomic data.
Collapse
Affiliation(s)
- K Syama
- Department of Computer Science, Birla Institute of Technology and Science Pilani Dubai Campus, Dubai International Academic City , Dubai, UAE
| | - J Angel Arul Jothi
- Department of Computer Science, Birla Institute of Technology and Science Pilani Dubai Campus, Dubai International Academic City , Dubai, UAE.
| | | |
Collapse
|
23
|
Busato S, Gordon M, Chaudhari M, Jensen I, Akyol T, Andersen S, Williams C. Compositionality, sparsity, spurious heterogeneity, and other data-driven challenges for machine learning algorithms within plant microbiome studies. CURRENT OPINION IN PLANT BIOLOGY 2023; 71:102326. [PMID: 36538837 PMCID: PMC9925409 DOI: 10.1016/j.pbi.2022.102326] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2022] [Revised: 11/08/2022] [Accepted: 11/21/2022] [Indexed: 06/17/2023]
Abstract
The plant-associated microbiome is a key component of plant systems, contributing to their health, growth, and productivity. The application of machine learning (ML) in this field promises to help untangle the relationships involved. However, measurements of microbial communities by high-throughput sequencing pose challenges for ML. Noise from low sample sizes, soil heterogeneity, and technical factors can impact the performance of ML. Additionally, the compositional and sparse nature of these datasets can impact the predictive accuracy of ML. We review recent literature from plant studies to illustrate that these properties often go unmentioned. We expand our analysis to other fields to quantify the degree to which mitigation approaches improve the performance of ML and describe the mathematical basis for this. With the advent of accessible analytical packages for microbiome data including learning models, researchers must be familiar with the nature of their datasets.
Collapse
Affiliation(s)
- Sebastiano Busato
- Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, USA; NC Plant Sciences Initiative, North Carolina State University, Raleigh, USA
| | - Max Gordon
- Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, USA; NC Plant Sciences Initiative, North Carolina State University, Raleigh, USA
| | - Meenal Chaudhari
- Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, USA; NC Plant Sciences Initiative, North Carolina State University, Raleigh, USA
| | - Ib Jensen
- Department of Molecular Biology and Genetics, Aarhus University, Aarhus, Denmark
| | - Turgut Akyol
- Department of Molecular Biology and Genetics, Aarhus University, Aarhus, Denmark
| | - Stig Andersen
- Department of Molecular Biology and Genetics, Aarhus University, Aarhus, Denmark
| | - Cranos Williams
- Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, USA; NC Plant Sciences Initiative, North Carolina State University, Raleigh, USA; Department of Plant and Microbial Biology, North Carolina State University, Raleigh, USA.
| |
Collapse
|
24
|
Microbiome Alterations in Alcohol Use Disorder and Alcoholic Liver Disease. Int J Mol Sci 2023; 24:ijms24032461. [PMID: 36768785 PMCID: PMC9916746 DOI: 10.3390/ijms24032461] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2022] [Revised: 01/22/2023] [Accepted: 01/25/2023] [Indexed: 01/30/2023] Open
Abstract
Microbiome alterations are emerging as one of the most important factors that influence the course of alcohol use disorder (AUD). Recent advances in bioinformatics enable more robust and accurate characterization of changes in the composition of the microbiome. In this study, our objective was to provide the most comprehensive and up-to-date evaluation of microbiome alterations associated with AUD and alcoholic liver disease (ALD). To achieve it, we have applied consistent, state of art bioinformatic workflow to raw reads from multiple 16S rRNA sequencing datasets. The study population consisted of 122 patients with AUD, 75 with ALD, 54 with non-alcoholic liver diseases, and 260 healthy controls. We have found several microbiome alterations that were consistent across multiple datasets. The most consistent changes included a significantly lower abundance of multiple butyrate-producing families, including Ruminococcaceae, Lachnospiraceae, and Oscillospiraceae in AUD compared to HC and further reduction of these families in ALD compared with AUD. Other important results include an increase in endotoxin-producing Proteobacteria in AUD, with the ALD group having the largest increase. All of these alterations can potentially contribute to increased intestinal permeability and inflammation associated with AUD and ALD.
Collapse
|
25
|
Shtossel O, Isakov H, Turjeman S, Koren O, Louzoun Y. Ordering taxa in image convolution networks improves microbiome-based machine learning accuracy. Gut Microbes 2023; 15:2224474. [PMID: 37345233 PMCID: PMC10288916 DOI: 10.1080/19490976.2023.2224474] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Accepted: 06/08/2023] [Indexed: 06/23/2023] Open
Abstract
The human gut microbiome is associated with a large number of disease etiologies. As such, it is a natural candidate for machine-learning-based biomarker development for multiple diseases and conditions. The microbiome is often analyzed using 16S rRNA gene sequencing or shotgun metagenomics. However, several properties of microbial sequence-based studies hinder machine learning (ML), including non-uniform representation, a small number of samples compared with the dimension of each sample, and sparsity of the data, with the majority of taxa present in a small subset of samples. We show here using a graph representation that the cladogram structure is as informative as the taxa frequency. We then suggest a novel method to combine information from different taxa and improve data representation for ML using microbial taxonomy. iMic (image microbiome) translates the microbiome to images through an iterative ordering scheme, and applies convolutional neural networks to the resulting image. We show that iMic has a higher precision in static microbiome gene sequence-based ML than state-of-the-art methods. iMic also facilitates the interpretation of the classifiers through an explainable artificial intelligence (AI) algorithm to iMic to detect taxa relevant to each condition. iMic is then extended to dynamic microbiome samples by translating them to movies.
Collapse
Affiliation(s)
- Oshrit Shtossel
- Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel
| | - Haim Isakov
- Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel
| | - Sondra Turjeman
- The Azrieli Faculty of Medicine, Bar-Ilan University, Safed, Israel
| | - Omry Koren
- The Azrieli Faculty of Medicine, Bar-Ilan University, Safed, Israel
| | - Yoram Louzoun
- Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel
| |
Collapse
|
26
|
Shen WX, Liang SR, Jiang YY, Chen YZ. Enhanced metagenomic deep learning for disease prediction and consistent signature recognition by restructured microbiome 2D representations. PATTERNS (NEW YORK, N.Y.) 2022; 4:100658. [PMID: 36699735 PMCID: PMC9868677 DOI: 10.1016/j.patter.2022.100658] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Revised: 07/15/2022] [Accepted: 11/15/2022] [Indexed: 12/23/2022]
Abstract
Metagenomic analysis has been explored for disease diagnosis and biomarker discovery. Low sample sizes, high dimensionality, and sparsity of metagenomic data challenge metagenomic investigations. Here, an unsupervised microbial embedding, grouping, and mapping algorithm (MEGMA) was developed to transform metagenomic data into individualized multichannel microbiome 2D representation by manifold learning and clustering of microbial profiles (e.g., composition, abundance, hierarchy, and taxonomy). These 2D representations enable enhanced disease prediction by established ConvNet-based AggMapNet models, outperforming the commonly used machine learning and deep learning models in metagenomic benchmark datasets. These 2D representations combined with AggMapNet explainable module robustly identified more reliable and replicable disease-prediction microbes (biomarkers). Employing the MEGMA-AggMapNet pipeline for biomarker identification from 5 disease datasets, 84% of the identified biomarkers have been described in over 74 distinct works as important for these diseases. Moreover, the method also discovered highly consistent sets of biomarkers in cross-cohort colorectal cancer (CRC) patients and microbial shifts in different CRC stages.
Collapse
Affiliation(s)
- Wan Xiang Shen
- The State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Biology, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China,Bioinformatics and Drug Design Group, Department of Pharmacy, and Center for Computational Science and Engineering, National University of Singapore, Singapore 117543, Singapore
| | - Shu Ran Liang
- The State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Biology, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China
| | - Yu Yang Jiang
- The State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Biology, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China,Corresponding author
| | - Yu Zong Chen
- The State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Biology, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China,Shenzhen Bay Laboratory, Shenzhen 518000, China,Corresponding author
| |
Collapse
|
27
|
Li P, Luo H, Ji B, Nielsen J. Machine learning for data integration in human gut microbiome. Microb Cell Fact 2022; 21:241. [PMID: 36419034 PMCID: PMC9685977 DOI: 10.1186/s12934-022-01973-4] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Accepted: 11/15/2022] [Indexed: 11/25/2022] Open
Abstract
Recent studies have demonstrated that gut microbiota plays critical roles in various human diseases. High-throughput technology has been widely applied to characterize the microbial ecosystems, which led to an explosion of different types of molecular profiling data, such as metagenomics, metatranscriptomics and metabolomics. For analysis of such data, machine learning algorithms have shown to be useful for identifying key molecular signatures, discovering potential patient stratifications, and particularly for generating models that can accurately predict phenotypes. In this review, we first discuss how dysbiosis of the intestinal microbiota is linked to human disease development and how potential modulation strategies of the gut microbial ecosystem can be used for disease treatment. In addition, we introduce categories and workflows of different machine learning approaches, and how they can be used to perform integrative analysis of multi-omics data. Finally, we review advances of machine learning in gut microbiome applications and discuss related challenges. Based on this we conclude that machine learning is very well suited for analysis of gut microbiome and that these approaches can be useful for development of gut microbe-targeted therapies, which ultimately can help in achieving personalized and precision medicine.
Collapse
Affiliation(s)
- Peishun Li
- grid.5371.00000 0001 0775 6028Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Hao Luo
- grid.5371.00000 0001 0775 6028Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Boyang Ji
- grid.5371.00000 0001 0775 6028Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden ,grid.510909.4BioInnovation Institute, Ole Maaløes Vej 3, DK2200 Copenhagen, Denmark
| | - Jens Nielsen
- grid.5371.00000 0001 0775 6028Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden ,grid.510909.4BioInnovation Institute, Ole Maaløes Vej 3, DK2200 Copenhagen, Denmark
| |
Collapse
|
28
|
Hernández Medina R, Kutuzova S, Nielsen KN, Johansen J, Hansen LH, Nielsen M, Rasmussen S. Machine learning and deep learning applications in microbiome research. ISME COMMUNICATIONS 2022; 2:98. [PMID: 37938690 PMCID: PMC9723725 DOI: 10.1038/s43705-022-00182-9] [Citation(s) in RCA: 66] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/24/2022] [Revised: 09/12/2022] [Accepted: 09/16/2022] [Indexed: 05/27/2023]
Abstract
The many microbial communities around us form interactive and dynamic ecosystems called microbiomes. Though concealed from the naked eye, microbiomes govern and influence macroscopic systems including human health, plant resilience, and biogeochemical cycling. Such feats have attracted interest from the scientific community, which has recently turned to machine learning and deep learning methods to interrogate the microbiome and elucidate the relationships between its composition and function. Here, we provide an overview of how the latest microbiome studies harness the inductive prowess of artificial intelligence methods. We start by highlighting that microbiome data - being compositional, sparse, and high-dimensional - necessitates special treatment. We then introduce traditional and novel methods and discuss their strengths and applications. Finally, we discuss the outlook of machine and deep learning pipelines, focusing on bottlenecks and considerations to address them.
Collapse
Affiliation(s)
- Ricardo Hernández Medina
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark
| | - Svetlana Kutuzova
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark
- Department of Computer Science, University of Copenhagen, DK-2100, Copenhagen Ø, Denmark
| | - Knud Nor Nielsen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark
- Department of Plant and Environmental Sciences, University of Copenhagen, DK-1871, Frederiksberg, Denmark
| | - Joachim Johansen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark
| | - Lars Hestbjerg Hansen
- Department of Plant and Environmental Sciences, University of Copenhagen, DK-1871, Frederiksberg, Denmark
| | - Mads Nielsen
- Department of Computer Science, University of Copenhagen, DK-2100, Copenhagen Ø, Denmark.
| | - Simon Rasmussen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark.
| |
Collapse
|
29
|
Li B, Zhong D, Qiao J, Jiang X. GNPI: Graph normalization to integrate phylogenetic information for metagenomic host phenotype prediction. Methods 2022; 205:11-17. [PMID: 35636652 DOI: 10.1016/j.ymeth.2022.05.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 05/17/2022] [Accepted: 05/26/2022] [Indexed: 11/24/2022] Open
Abstract
Microorganisms play important roles in our lives especially on metabolism and diseases. Determining the probability of human suffering from specific diseases and the severity of the disease based on microbial genes is the crucial research for understanding the relationship between microbes and diseases. Previous could extract the topological information of phylogenetic trees and integrate them to metagenomic datasets, thus enable classifiers to learn more information in limited datasets and thus improve the performance of the models. In this paper, we proposed a GNPI model to better learn the structure of phylogenetic trees. GNPI maintained the original vector format of metagenomic datasets, while previous research had to change the input form to matrices. The vector-like form of the input data can be easily adopted in the baseline machine learning models and is available for deep learning models. The datasets processed with GNPI help enhance the accuracy of machine learning and deep learning models in three different datasets. GNPI is an interpretable data processing method for host phenotype prediction and other bioinformatics tasks.
Collapse
Affiliation(s)
- Bojing Li
- Hubei Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, China; School of Computer, Central China Normal University, Wuhan, China
| | - Duo Zhong
- Hubei Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, China; School of Computer, Central China Normal University, Wuhan, China
| | - Jimei Qiao
- Mathematics and Science College, Shanghai Normal University, Shanghai, China
| | - Xingpeng Jiang
- Hubei Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, China; School of Computer, Central China Normal University, Wuhan, China; National Language Resources Monitoring & Research Center for Network Media, Central China Normal University, Wuhan, China.
| |
Collapse
|
30
|
Mreyoud Y, Song M, Lim J, Ahn TH. MegaD: Deep Learning for Rapid and Accurate Disease Status Prediction of Metagenomic Samples. Life (Basel) 2022; 12:life12050669. [PMID: 35629336 PMCID: PMC9143510 DOI: 10.3390/life12050669] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Revised: 04/25/2022] [Accepted: 04/26/2022] [Indexed: 12/23/2022] Open
Abstract
The diversity within different microbiome communities that drive biogeochemical processes influences many different phenotypes. Analyses of these communities and their diversity by countless microbiome projects have revealed an important role of metagenomics in understanding the complex relation between microbes and their environments. This relationship can be understood in the context of microbiome composition of specific known environments. These compositions can then be used as a template for predicting the status of similar environments. Machine learning has been applied as a key component to this predictive task. Several analysis tools have already been published utilizing machine learning methods for metagenomic analysis. Despite the previously proposed machine learning models, the performance of deep neural networks is still under-researched. Given the nature of metagenomic data, deep neural networks could provide a strong boost to growth in the prediction accuracy in metagenomic analysis applications. To meet this urgent demand, we present a deep learning based tool that utilizes a deep neural network implementation for phenotypic prediction of unknown metagenomic samples. (1) First, our tool takes as input taxonomic profiles from 16S or WGS sequencing data. (2) Second, given the samples, our tool builds a model based on a deep neural network by computing multi-level classification. (3) Lastly, given the model, our tool classifies an unknown sample with its unlabeled taxonomic profile. In the benchmark experiments, we deduced that an analysis method facilitating a deep neural network such as our tool can show promising results in increasing the prediction accuracy on several samples compared to other machine learning models.
Collapse
Affiliation(s)
- Yassin Mreyoud
- Program in Bioinformatics and Computational Biology, Saint Louis University, Saint Louis, MO 63104, USA;
| | - Myoungkyu Song
- Department of Computer Science, University of Nebraska Omaha, Omaha, NE 68182, USA;
| | - Jihun Lim
- Saint Paul Preparatory, Seoul 06593, Korea;
| | - Tae-Hyuk Ahn
- Program in Bioinformatics and Computational Biology, Saint Louis University, Saint Louis, MO 63104, USA;
- Department of Computer Science, Saint Louis University, Saint Louis, MO 63104, USA
- Correspondence:
| |
Collapse
|
31
|
Chen X, Zhu Z, Zhang W, Wang Y, Wang F, Yang J, Wong KC. Human disease prediction from microbiome data by multiple feature fusion and deep learning. iScience 2022; 25:104081. [PMID: 35372808 PMCID: PMC8971930 DOI: 10.1016/j.isci.2022.104081] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2021] [Revised: 09/16/2021] [Accepted: 03/13/2022] [Indexed: 10/29/2022] Open
Abstract
Human disease prediction from microbiome data has broad implications in metagenomics. It is rare for the existing methods to consider abundance profiles from both known and unknown microbial organisms, or capture the taxonomic relationships among microbial taxa, leading to significant information loss. On the other hand, deep learning has shown unprecedented advantages in classification tasks for its feature-learning ability. However, it encounters the opposite situation in metagenome-based disease prediction since high-dimensional low-sample-size metagenomic datasets can lead to severe overfitting; and black-box model fails in providing biological explanations. To circumvent the related problems, we developed MetaDR, a comprehensive machine learning-based framework that integrates various information and deep learning to predict human diseases. Experimental results indicate that MetaDR achieves competitive prediction performance with a reduction in running time, and effectively discovers the informative features with biological insights.
Collapse
Affiliation(s)
- Xingjian Chen
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Zifan Zhu
- Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, USA
| | - Weitong Zhang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Yuchen Wang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Fuzhou Wang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Jianyi Yang
- School of Mathematical Sciences, Nankai University, Tianjin, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR.,Hong Kong Institute for Data Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| |
Collapse
|
32
|
Grazioli F, Siarheyeu R, Alqassem I, Henschel A, Pileggi G, Meiser A. Microbiome-based disease prediction with multimodal variational information bottlenecks. PLoS Comput Biol 2022; 18:e1010050. [PMID: 35404958 PMCID: PMC9022840 DOI: 10.1371/journal.pcbi.1010050] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Revised: 04/21/2022] [Accepted: 03/22/2022] [Indexed: 01/12/2023] Open
Abstract
Scientific research is shedding light on the interaction of the gut microbiome with the human host and on its role in human health. Existing machine learning methods have shown great potential in discriminating healthy from diseased microbiome states. Most of them leverage shotgun metagenomic sequencing to extract gut microbial species-relative abundances or strain-level markers. Each of these gut microbial profiling modalities showed diagnostic potential when tested separately; however, no existing approach combines them in a single predictive framework. Here, we propose the Multimodal Variational Information Bottleneck (MVIB), a novel deep learning model capable of learning a joint representation of multiple heterogeneous data modalities. MVIB achieves competitive classification performance while being faster than existing methods. Additionally, MVIB offers interpretable results. Our model adopts an information theoretic interpretation of deep neural networks and computes a joint stochastic encoding of different input data modalities. We use MVIB to predict whether human hosts are affected by a certain disease by jointly analysing gut microbial species-relative abundances and strain-level markers. MVIB is evaluated on human gut metagenomic samples from 11 publicly available disease cohorts covering 6 different diseases. We achieve high performance (0.80 < ROC AUC < 0.95) on 5 cohorts and at least medium performance on the remaining ones. We adopt a saliency technique to interpret the output of MVIB and identify the most relevant microbial species and strain-level markers to the model’s predictions. We also perform cross-study generalisation experiments, where we train and test MVIB on different cohorts of the same disease, and overall we achieve comparable results to the baseline approach, i.e. the Random Forest. Further, we evaluate our model by adding metabolomic data derived from mass spectrometry as a third input modality. Our method is scalable with respect to input data modalities and has an average training time of < 1.4 seconds. The source code and the datasets used in this work are publicly available. The gut microbiome can be an indicator of various diseases due to its interaction with the human system. Our main objective is to improve on the current state of the art in microbiome classification for diagnostic purposes. A rich body of literature evidences the clinical value of microbiome predictive models. Here, we propose the Multimodal Variational Information Bottleneck (MVIB), a novel deep learning model for microbiome-based disease prediction. MVIB learns a joint stochastic encoding of different input data modalities to predict the output class. We use MVIB to predict whether human hosts are affected by a certain disease by jointly analysing gut microbial species-relative abundance and strain-level marker profiles. Both of these gut microbial features showed diagnostic potential when tested separately in previous studies; however, no research has combined them in a single predictive tool. We evaluate MVIB on various human gut metagenomic samples from 11 publicly available disease cohorts. MVIB achieves competitive performance compared to state-of-the-art methods. Additionally, we evaluate our model by adding metabolomic data as a third input modality and we show that MVIB is scalable with respect to input feature modalities. Further, we adopt a saliency technique to interpret the output of MVIB and identify the most relevant microbial species and strain-level markers to our model predictions.
Collapse
Affiliation(s)
| | | | | | - Andreas Henschel
- Department of Electrical Engineering and Computer Science, Khalifa University, Abu Dhabi, UAE
- Research and Data Intelligence Support Center, Khalifa University, Abu Dhabi, UAE
| | | | | |
Collapse
|
33
|
Host phenotype classification from human microbiome data is mainly driven by the presence of microbial taxa. PLoS Comput Biol 2022; 18:e1010066. [PMID: 35446845 PMCID: PMC9064115 DOI: 10.1371/journal.pcbi.1010066] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Revised: 05/03/2022] [Accepted: 03/29/2022] [Indexed: 12/14/2022] Open
Abstract
Machine learning-based classification approaches are widely used to predict host phenotypes from microbiome data. Classifiers are typically employed by considering operational taxonomic units or relative abundance profiles as input features. Such types of data are intrinsically sparse, which opens the opportunity to make predictions from the presence/absence rather than the relative abundance of microbial taxa. This also poses the question whether it is the presence rather than the abundance of particular taxa to be relevant for discrimination purposes, an aspect that has been so far overlooked in the literature. In this paper, we aim at filling this gap by performing a meta-analysis on 4,128 publicly available metagenomes associated with multiple case-control studies. At species-level taxonomic resolution, we show that it is the presence rather than the relative abundance of specific microbial taxa to be important when building classification models. Such findings are robust to the choice of the classifier and confirmed by statistical tests applied to identifying differentially abundant/present taxa. Results are further confirmed at coarser taxonomic resolutions and validated on 4,026 additional 16S rRNA samples coming from 30 public case-control studies. The composition of the human microbiome has been linked to a large number of different diseases. In this context, classification methodologies based on machine learning approaches have represented a promising tool for diagnostic purposes from metagenomics data. The link between microbial population composition and host phenotypes has been usually performed by considering taxonomic profiles represented by relative abundances of microbial species. In this study, we show that it is more the presence rather than the relative abundance of microbial taxa to be relevant to maximize classification accuracy. This is accomplished by conducting a meta-analysis on more than 4,000 shotgun metagenomes coming from 25 case-control studies and in which original relative abundance data are degraded to presence/absence profiles. Findings are also extended to 16S rRNA data and advance the research field in building prediction models directly from human microbiome data.
Collapse
|
34
|
Ratiner K, Abdeen SK, Goldenberg K, Elinav E. Utilization of Host and Microbiome Features in Determination of Biological Aging. Microorganisms 2022; 10:668. [PMID: 35336242 PMCID: PMC8950177 DOI: 10.3390/microorganisms10030668] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Revised: 03/08/2022] [Accepted: 03/18/2022] [Indexed: 12/13/2022] Open
Abstract
The term 'old age' generally refers to a period characterized by profound changes in human physiological functions and susceptibility to disease that accompanies the final years of a person's life. Despite the conventional definition of old age as exceeding the age of 65 years old, quantifying aging as a function of life years does not necessarily reflect how the human body ages. In contrast, characterizing biological (or physiological) aging based on functional parameters may better reflect a person's temporal physiological status and associated disease susceptibility state. As such, differentiating 'chronological aging' from 'biological aging' holds the key to identifying individuals featuring accelerated aging processes despite having a young chronological age and stratifying them to tailored surveillance, diagnosis, prevention, and treatment. Emerging evidence suggests that the gut microbiome changes along with physiological aging and may play a pivotal role in a variety of age-related diseases, in a manner that does not necessarily correlate with chronological age. Harnessing of individualized gut microbiome data and integration of host and microbiome parameters using artificial intelligence and machine learning pipelines may enable us to more accurately define aging clocks. Such holobiont-based estimates of a person's physiological age may facilitate prediction of age-related physiological status and risk of development of age-associated diseases.
Collapse
Affiliation(s)
- Karina Ratiner
- Immunology Department, Weizmann Institute of Science, 234 Herzl Street, Rehovot 7610001, Israel; (K.R.); (S.K.A.); (K.G.)
| | - Suhaib K. Abdeen
- Immunology Department, Weizmann Institute of Science, 234 Herzl Street, Rehovot 7610001, Israel; (K.R.); (S.K.A.); (K.G.)
| | - Kim Goldenberg
- Immunology Department, Weizmann Institute of Science, 234 Herzl Street, Rehovot 7610001, Israel; (K.R.); (S.K.A.); (K.G.)
| | - Eran Elinav
- Immunology Department, Weizmann Institute of Science, 234 Herzl Street, Rehovot 7610001, Israel; (K.R.); (S.K.A.); (K.G.)
- Division of Cancer-Microbiome Research, Deutsches Krebsforschungszentrum (DKFZ), Neuenheimer Feld 280, 69120 Heidelberg, Germany
| |
Collapse
|
35
|
Ling W, Qi Y, Hua X, Wu MC. Deep ensemble learning over the microbial phylogenetic tree (DeepEn-Phy). PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE 2021; 2021:470-477. [PMID: 36704639 PMCID: PMC9875567 DOI: 10.1109/bibm52615.2021.9669654] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Successful prediction of clinical outcomes facilitates tailored diagnosis and treatment. The microbiome has been shown to be an important biomarker to predict host clinical outcomes. Further, the incorporation of microbial phylogeny, the evolutionary relationship among microbes, has been demonstrated to improve prediction accuracy. We propose a phylogeny-driven deep neural network (PhyNN) and develop an ensemble method, DeepEn-Phy, for host clinical outcome prediction. The method is designed to optimally extract features from phylogeny, thereby take full advantage of the information in phylogeny while harnessing the core principles of phylogeny (in contrast to taxonomy). We apply DeepEn-Phy to a real large microbiome data set to predict both categorical and continuous clinical outcomes. DeepEn-Phy demonstrates superior prediction performance to existing machine learning and deep learning approaches. Overall, DeepEn-Phy provides a new strategy for designing deep neural network architectures within the context of phylogeny-constrained microbiome data.
Collapse
Affiliation(s)
- Wodan Ling
- Fred Hutchinson, Cancer Research Center, Seattle, USA
| | | | - Xing Hua
- Fred Hutchinson, Cancer Research Center, Seattle, USA
| | - Michael C. Wu
- Fred Hutchinson, Cancer Research Center, Seattle, USA
| |
Collapse
|
36
|
Deng Z, Zhang J, Li J, Zhang X. Application of Deep Learning in Plant-Microbiota Association Analysis. Front Genet 2021; 12:697090. [PMID: 34691142 PMCID: PMC8531731 DOI: 10.3389/fgene.2021.697090] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2021] [Accepted: 08/31/2021] [Indexed: 01/04/2023] Open
Abstract
Unraveling the association between microbiome and plant phenotype can illustrate the effect of microbiome on host and then guide the agriculture management. Adequate identification of species and appropriate choice of models are two challenges in microbiome data analysis. Computational models of microbiome data could help in association analysis between the microbiome and plant host. The deep learning methods have been widely used to learn the microbiome data due to their powerful strength of handling the complex, sparse, noisy, and high-dimensional data. Here, we review the analytic strategies in the microbiome data analysis and describe the applications of deep learning models for plant–microbiome correlation studies. We also introduce the application cases of different models in plant–microbiome correlation analysis and discuss how to adapt the models on the critical steps in data processing. From the aspect of data processing manner, model structure, and operating principle, most deep learning models are suitable for the plant microbiome data analysis. The ability of feature representation and pattern recognition is the advantage of deep learning methods in modeling and interpretation for association analysis. Based on published computational experiments, the convolutional neural network and graph neural networks could be recommended for plant microbiome analysis.
Collapse
Affiliation(s)
- Zhiyu Deng
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, China.,Center of Economic Botany, Core Botanical Gardens, Chinese Academy of Sciences, Wuhan, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Jinming Zhang
- Department of Infectious Diseases, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, China
| | - Junya Li
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, China.,Center of Economic Botany, Core Botanical Gardens, Chinese Academy of Sciences, Wuhan, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Xiujun Zhang
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, China.,Center of Economic Botany, Core Botanical Gardens, Chinese Academy of Sciences, Wuhan, China
| |
Collapse
|
37
|
Zhao Z, Woloszynek S, Agbavor F, Mell JC, Sokhansanj BA, Rosen GL. Learning, visualizing and exploring 16S rRNA structure using an attention-based deep neural network. PLoS Comput Biol 2021; 17:e1009345. [PMID: 34550967 PMCID: PMC8496832 DOI: 10.1371/journal.pcbi.1009345] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 10/07/2021] [Accepted: 08/12/2021] [Indexed: 01/04/2023] Open
Abstract
Recurrent neural networks with memory and attention mechanisms are widely used in natural language processing because they can capture short and long term sequential information for diverse tasks. We propose an integrated deep learning model for microbial DNA sequence data, which exploits convolutional neural networks, recurrent neural networks, and attention mechanisms to predict taxonomic classifications and sample-associated attributes, such as the relationship between the microbiome and host phenotype, on the read/sequence level. In this paper, we develop this novel deep learning approach and evaluate its application to amplicon sequences. We apply our approach to short DNA reads and full sequences of 16S ribosomal RNA (rRNA) marker genes, which identify the heterogeneity of a microbial community sample. We demonstrate that our implementation of a novel attention-based deep network architecture, Read2Pheno, achieves read-level phenotypic prediction. Training Read2Pheno models will encode sequences (reads) into dense, meaningful representations: learned embedded vectors output from the intermediate layer of the network model, which can provide biological insight when visualized. The attention layer of Read2Pheno models can also automatically identify nucleotide regions in reads/sequences which are particularly informative for classification. As such, this novel approach can avoid pre/post-processing and manual interpretation required with conventional approaches to microbiome sequence classification. We further show, as proof-of-concept, that aggregating read-level information can robustly predict microbial community properties, host phenotype, and taxonomic classification, with performance at least comparable to conventional approaches. An implementation of the attention-based deep learning network is available at https://github.com/EESI/sequence_attention (a python package) and https://github.com/EESI/seq2att (a command line tool).
Collapse
Affiliation(s)
- Zhengqiao Zhao
- Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Stephen Woloszynek
- Beth Israel Deaconess Medical Center, Boston, Massachusetts, United States of America
- Harvard Medical School, Boston, Massachusetts, United States of America
| | - Felix Agbavor
- School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Joshua Chang Mell
- College of Medicine, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Bahrad A. Sokhansanj
- Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Gail L. Rosen
- Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America
| |
Collapse
|
38
|
Zhang W, Chen X, Wong KC. Noninvasive early diagnosis of intestinal diseases based on artificial intelligence in genomics and microbiome. J Gastroenterol Hepatol 2021; 36:823-831. [PMID: 33880763 DOI: 10.1111/jgh.15500] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/10/2021] [Revised: 03/15/2021] [Accepted: 03/17/2021] [Indexed: 12/15/2022]
Abstract
The maturing development in artificial intelligence (AI) and genomics has propelled the advances in intestinal diseases including intestinal cancer, inflammatory bowel disease (IBD), and irritable bowel syndrome (IBS). On the other hand, colorectal cancer is the second most deadly and the third most common type of cancer in the world according to GLOBOCAN 2020 data. The mechanisms behind IBD and IBS are still speculative. The conventional methods to identify colorectal cancer, IBD, and IBS are based on endoscopy or colonoscopy to identify lesions. However, it is invasive, demanding, and time-consuming for early-stage intestinal diseases. To address those problems, new strategies based on blood and/or human microbiome in gut, colon, or even feces were developed; those methods took advantage of high-throughput sequencing and machine learning approaches. In this review, we summarize the recent research and methods to diagnose intestinal diseases with machine learning technologies based on cell-free DNA and microbiome data generated by amplicon sequencing or whole-genome sequencing. Those methods play an important role in not only intestinal disease diagnosis but also therapy development in the near future.
Collapse
Affiliation(s)
- Weitong Zhang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Xingjian Chen
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR.,Hong Kong Institute for Data Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| |
Collapse
|
39
|
Moreno-Indias I, Lahti L, Nedyalkova M, Elbere I, Roshchupkin G, Adilovic M, Aydemir O, Bakir-Gungor B, Santa Pau ECD, D’Elia D, Desai MS, Falquet L, Gundogdu A, Hron K, Klammsteiner T, Lopes MB, Marcos-Zambrano LJ, Marques C, Mason M, May P, Pašić L, Pio G, Pongor S, Promponas VJ, Przymus P, Saez-Rodriguez J, Sampri A, Shigdel R, Stres B, Suharoschi R, Truu J, Truică CO, Vilne B, Vlachakis D, Yilmaz E, Zeller G, Zomer AL, Gómez-Cabrero D, Claesson MJ. Statistical and Machine Learning Techniques in Human Microbiome Studies: Contemporary Challenges and Solutions. Front Microbiol 2021; 12:635781. [PMID: 33692771 PMCID: PMC7937616 DOI: 10.3389/fmicb.2021.635781] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Accepted: 01/28/2021] [Indexed: 12/23/2022] Open
Abstract
The human microbiome has emerged as a central research topic in human biology and biomedicine. Current microbiome studies generate high-throughput omics data across different body sites, populations, and life stages. Many of the challenges in microbiome research are similar to other high-throughput studies, the quantitative analyses need to address the heterogeneity of data, specific statistical properties, and the remarkable variation in microbiome composition across individuals and body sites. This has led to a broad spectrum of statistical and machine learning challenges that range from study design, data processing, and standardization to analysis, modeling, cross-study comparison, prediction, data science ecosystems, and reproducible reporting. Nevertheless, although many statistics and machine learning approaches and tools have been developed, new techniques are needed to deal with emerging applications and the vast heterogeneity of microbiome data. We review and discuss emerging applications of statistical and machine learning techniques in human microbiome studies and introduce the COST Action CA18131 "ML4Microbiome" that brings together microbiome researchers and machine learning experts to address current challenges such as standardization of analysis pipelines for reproducibility of data analysis results, benchmarking, improvement, or development of existing and new tools and ontologies.
Collapse
Affiliation(s)
- Isabel Moreno-Indias
- Instituto de Investigación Biomédica de Málaga (IBIMA), Unidad de Gestión Clìnica de Endocrinologìa y Nutrición, Hospital Clìnico Universitario Virgen de la Victoria, Universidad de Málaga, Málaga, Spain
- Centro de Investigación Biomeìdica en Red de Fisiopatologtìa de la Obesidad y la Nutrición (CIBEROBN), Instituto de Salud Carlos III, Madrid, Spain
| | - Leo Lahti
- Department of Computing, University of Turku, Turku, Finland
| | - Miroslava Nedyalkova
- Human Genetics and Disease Mechanisms, Latvian Biomedical Research and Study Centre, Riga, Latvia
| | - Ilze Elbere
- Latvian Biomedical Research and Study Centre, Riga, Latvia
| | | | - Muhamed Adilovic
- Department of Genetics and Bioengineering, International University of Sarajevo, Sarajevo, Bosnia and Herzegovina
| | - Onder Aydemir
- Department of Electrical and Electronics Engineering, Karadeniz Technical University, Trabzon, Turkey
| | - Burcu Bakir-Gungor
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| | | | - Domenica D’Elia
- Department for Biomedical Sciences, Institute for Biomedical Technologies, National Research Council, Bari, Italy
| | - Mahesh S. Desai
- Department of Infection and Immunity, Luxembourg Institute of Health, Esch-sur-Alzette, Luxembourg
- Odense Research Center for Anaphylaxis, Department of Dermatology and Allergy Center, Odense University Hospital, University of Southern Denmark, Odense, Denmark
| | - Laurent Falquet
- Department of Biology, University of Fribourg, Fribourg, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Aycan Gundogdu
- Department of Microbiology and Clinical Microbiology, Faculty of Medicine, Erciyes University, Kayseri, Turkey
- Metagenomics Laboratory, Genome and Stem Cell Center (GenKök), Erciyes University, Kayseri, Turkey
| | - Karel Hron
- Department of Mathematical Analysis and Applications of Mathematics, Palacký University, Olomouc, Czechia
| | | | - Marta B. Lopes
- NOVA Laboratory for Computer Science and Informatics (NOVA LINCS), FCT, UNL, Caparica, Portugal
- Centro de Matemática e Aplicações (CMA), FCT, UNL, Caparica, Portugal
| | - Laura Judith Marcos-Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | - Cláudia Marques
- CINTESIS, NOVA Medical School, NMS, Universidade Nova de Lisboa, Lisbon, Portugal
| | - Michael Mason
- Computational Oncology, Sage Bionetworks, Seattle, WA, United States
| | - Patrick May
- Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Lejla Pašić
- Sarajevo Medical School, University Sarajevo School of Science and Technology, Sarajevo, Bosnia and Herzegovina
| | - Gianvito Pio
- Department of Computer Science, University of Bari Aldo Moro, Bari, Italy
| | - Sándor Pongor
- Faculty of Information Tehnology and Bionics, Pázmány University, Budapest, Hungary
| | - Vasilis J. Promponas
- Bioinformatics Research Laboratory, Department of Biological Sciences, University of Cyprus, Nicosia, Cyprus
| | - Piotr Przymus
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruñ, Poland
| | - Julio Saez-Rodriguez
- Institute of Computational Biomedicine, Heidelberg University, Faculty of Medicine and Heidelberg University Hospital, Heidelberg, Germany
| | - Alexia Sampri
- Division of Informatics, Imaging and Data Sciences, School of Health Sciences, University of Manchester, Manchester, United Kingdom
| | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Blaz Stres
- Jozef Stefan Institute, Ljubljana, Slovenia
- Biotechnical Faculty, University of Ljubljana, Ljubljana, Slovenia
- Faculty of Civil and Geodetic Engineering, University of Ljubljana, Ljubljana, Slovenia
| | - Ramona Suharoschi
- Molecular Nutrition and Proteomics Lab, Faculty of the Food Science and Technology, Institute of Life Sciences, University of Agricultural Sciences and Veterinary Medicine of Cluj-Napoca, Cluj-Napoca, Romania
| | - Jaak Truu
- Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | - Ciprian-Octavian Truică
- Department of Computer Science and Engineering, Faculty of Automatic Control and Computers, University Politehnica of Bucharest, Bucharest, Romania
| | - Baiba Vilne
- Bioinformatics Research Unit, Riga Stradins University, Riga, Latvia
| | - Dimitrios Vlachakis
- Laboratory of Genetics, Department of Biotechnology, School of Applied Biology and Biotechnology, Agricultural University of Athens, Athens, Greece
| | - Ercument Yilmaz
- Department of Computer Technologies, Karadeniz Technical University, Trabzon, Turkey
| | - Georg Zeller
- European Molecular Biology Laboratory, Structural and Computational Biology Unit, Heidelberg, Germany
| | - Aldert L. Zomer
- Department of Infectious Diseases and Immunology, Faculty of Veterinary Medicine, Utrecht University, Utrecht, Netherlands
| | - David Gómez-Cabrero
- Navarrabiomed, Complejo Hospitalario de Navarra (CHN), IdiSNA, Universidad Pública de Navarra (UPNA), Pamplona, Spain
| | - Marcus J. Claesson
- School of Microbiology and APC Microbiome Ireland, University College Cork, Cork, Ireland
| |
Collapse
|