1
|
Fernández-Edreira D, Liñares-Blanco J, V.-del-Río P, Fernandez-Lozano C. VIBES: A consensus subtyping of the vaginal microbiota reveals novel classification criteria. Comput Struct Biotechnol J 2024; 23:148-156. [PMID: 38144944 PMCID: PMC10749217 DOI: 10.1016/j.csbj.2023.11.050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Revised: 11/16/2023] [Accepted: 11/27/2023] [Indexed: 12/26/2023] Open
Abstract
This study aimed to develop a robust classification scheme for stratifying patients based on vaginal microbiome. By employing consensus clustering analysis, we identified four distinct clusters using a cohort that includes individuals diagnosed with Bacterial Vaginosis (BV) as well as control participants, each characterized by unique patterns of microbiome species abundances. Notably, the consistent distribution of these clusters was observed across multiple external cohorts, such as SRA022855, SRA051298, PRJNA208535, PRJNA797778, and PRJNA302078 obtained from public repositories, demonstrating the generalizability of our findings. We further trained an elastic net model to predict these clusters, and its performance was evaluated in various external cohorts. Moreover, we developed VIBES, a user-friendly R package that encapsulates the model for convenient implementation and enables easy predictions on new data. Remarkably, we explored the applicability of this new classification scheme in providing valuable insights into disease progression, treatment response, and potential clinical outcomes in BV patients. Specifically, we demonstrated that the combined output of VIBES and VALENCIA scores could effectively predict the response to metronidazole antibiotic treatment in BV patients. Therefore, this study's outcomes contribute to our understanding of BV heterogeneity and lay the groundwork for personalized approaches to BV management and treatment selection.
Collapse
Affiliation(s)
- Diego Fernández-Edreira
- Department of Computer Science and Information Technologies, Faculty of Computer Science, CITIC-Research Center of Information and Communication Technologies, Universidade da Coruña, A Coruña, Spain
| | | | - Patricia V.-del-Río
- Servicio de Ginecología, Hospital Universitario Lucus Augusti (HULA). Servizo Galego de Saúde (SERGAS), Spain
| | - Carlos Fernandez-Lozano
- Department of Computer Science and Information Technologies, Faculty of Computer Science, CITIC-Research Center of Information and Communication Technologies, Universidade da Coruña, A Coruña, Spain
| |
Collapse
|
2
|
Gonia S, Heisel T, Miller N, Haapala J, Harnack L, Georgieff MK, Fields DA, Knights D, Jacobs K, Seburg E, Demerath EW, Gale CA, Swanson MH. Maternal oral probiotic use is associated with decreased breastmilk inflammatory markers, infant fecal microbiome variation, and altered recognition memory responses in infants-a pilot observational study. Front Nutr 2024; 11:1456111. [PMID: 39385777 PMCID: PMC11462058 DOI: 10.3389/fnut.2024.1456111] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Accepted: 09/02/2024] [Indexed: 10/12/2024] Open
Abstract
Introduction Early life gut microbiomes are important for brain and immune system development in animal models. Probiotic use has been proposed as a strategy to promote health via modulation of microbiomes. In this observational study, we explore if early life exposure to probiotics via the mother during pregnancy and lactation, is associated with decreased inflammation in breastmilk, maternal and infant microbiome variation, and altered infant neurodevelopmental features. Methods Exclusively breastfeeding mother-infant dyads were recruited as part of the "Mothers and Infants Linked for Healthy Growth (MILk) Study." Probiotic comparison groups were defined by exposure to maternal probiotics (NO/YES) and by timing of probiotic exposure (prenatal, postnatal, total). C-reactive protein (CRP) and IL-6 levels were determined in breastmilk by immunoassays, and microbiomes were characterized from 1-month milk and from 1- and 6-month infant feces by 16S rDNA sequencing. Infant brain function was profiled via electroencephalogram (EEG); we assessed recognition memory using event-related potential (ERP) responses to familiar and novel auditory (1 month) and visual (6 months) stimuli. Statistical comparisons of study outcomes between probiotic groups were performed using permutational analysis of variance (PERMANOVA) (microbiome) and linear models (all other study outcomes), including relevant covariables as indicated. Results We observed associations between probiotic exposure and lower breastmilk CRP and IL-6 levels, and infant gut microbiome variation at 1- and 6-months of age (including higher abundances of Bifidobacteria and Lactobacillus). In addition, maternal probiotic exposure was associated with differences in infant ERP features at 6-months of age. Specifically, infants who were exposed to postnatal maternal probiotics (between the 1- and 6-month study visits) via breastfeeding/breastmilk, had larger differential responses between familiar and novel visual stimuli with respect to the late slow wave component of the EEG, which may indicate greater memory updating potential. The milk of mothers of this subgroup of infants had lower IL-6 levels and infants had different 6-month fecal microbiomes as compared to those in the "NO" maternal probiotics group. Discussion These results support continued research into "Microbiota-Gut-Brain" connections during early life and the role of pre- and postnatal probiotics in mothers to promote healthy microbiome-associated outcomes in infants.
Collapse
Affiliation(s)
- Sara Gonia
- Department of Pediatrics, University of Minnesota, Minneapolis, MN, United States
| | - Timothy Heisel
- Department of Pediatrics, University of Minnesota, Minneapolis, MN, United States
| | - Neely Miller
- Department of Psychology, University of Minnesota, Minneapolis, MN, United States
| | - Jacob Haapala
- Division of Epidemiology and Community Health, School of Public Health, University of Minnesota, Minneapolis, MN, United States
| | - Lisa Harnack
- Division of Epidemiology and Community Health, School of Public Health, University of Minnesota, Minneapolis, MN, United States
| | - Michael K. Georgieff
- Department of Pediatrics, University of Minnesota, Minneapolis, MN, United States
| | - David A. Fields
- Harold Hamm Diabetes Center, University of Oklahoma Health Sciences Center, Oklahoma City, OK, United States
| | - Dan Knights
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, United States
| | - Katherine Jacobs
- Division of Maternal-Fetal Medicine, University of Minnesota, Minneapolis, MN, United States
| | - Elisabeth Seburg
- Pregnancy and Child Health Research Center, HealthPartners Institute, Bloomington, MN, United States
| | - Ellen W. Demerath
- Division of Epidemiology and Community Health, School of Public Health, University of Minnesota, Minneapolis, MN, United States
| | - Cheryl A. Gale
- Department of Pediatrics, University of Minnesota, Minneapolis, MN, United States
| | - Marie H. Swanson
- Department of Pediatrics, University of Minnesota, Minneapolis, MN, United States
| |
Collapse
|
3
|
Maaskant A, Voermans B, Levin E, de Goffau MC, Plomp N, Schuren F, Remarque EJ, Smits A, Langermans JAM, Bakker J, Montijn R. Microbiome signature suggestive of lactose-intolerance in rhesus macaques (Macaca mulatta) with intermittent chronic diarrhea. Anim Microbiome 2024; 6:53. [PMID: 39313845 PMCID: PMC11421201 DOI: 10.1186/s42523-024-00338-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Accepted: 09/06/2024] [Indexed: 09/25/2024] Open
Abstract
BACKGROUND Chronic diarrhea is a common cause of mortality and morbidity in captive rhesus macaques (Macaca mulatta). The exact etiology of chronic diarrhea in macaques remains unidentified. The occurrence of diarrhea is frequently linked to dysbiosis within the gut microbiome. Research into microbiome signatures correlated with diarrhea in macaques have predominantly been conducted with single sample collections. Our analysis was based on the metagenomic composition of longitudinally acquired fecal samples from rhesus macaques with chronic diarrhea and clinically healthy rhesus macaques that were obtained over the course of two years. We aimed to investigate potential relationships between the macaque gut microbiome, the presence of diarrhea and diet interventions with a selection of commercially available monkey diets. RESULTS The microbiome signature of macaques with intermittent chronic diarrhea showed a significant increase in lactate producing bacteria e.g. lactobacilli, and an increase in fermenters of lactate and succinate. Strikingly, two lactose free diets were associated with a lower incidence of diarrhea. CONCLUSION A lactose intolerance mechanism is suggested in these animals by the bloom of Lactobacillus in the presence of lactose resulting in an overproduction of intermediate fermentation products likely led to osmotically induced diarrhea. This study provides new insights into suspected microbiome-lactose intolerance relationship in rhesus macaques with intermittent chronic diarrhea. The integration of machine learning with metagenomic data analysis holds potential for developing targeted dietary interventions and therapeutic strategies and therefore ensuring a healthier and more resilient primate population.
Collapse
Affiliation(s)
- Annemiek Maaskant
- Biomedical Primate Research Centre, Lange Kleiweg 161, 2288 GJ, Rijswijk, The Netherlands.
- Department Population Health Sciences, Animals in Science and Society, Faculty of Veterinary Medicine, Utrecht University, Heidelberglaan 8, 3584 CM, Utrecht, The Netherlands.
| | - Bas Voermans
- HORAIZON Technology BV, Marshallaan 2, 2625 GZ, Delft, The Netherlands.
- Department of Vascular Medicine, Amsterdam UMC, Meibergdreef 9, 1105 AZ, Amsterdam, The Netherlands.
| | - Evgeni Levin
- HORAIZON Technology BV, Marshallaan 2, 2625 GZ, Delft, The Netherlands
| | - Marcus C de Goffau
- HORAIZON Technology BV, Marshallaan 2, 2625 GZ, Delft, The Netherlands
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Nicole Plomp
- Department of Microbiology and Systems Biology, Organization for Applied Scientific Research (TNO), Sylviusweg 71, 2333 BE, Leiden, The Netherlands
| | - Frank Schuren
- Department of Microbiology and Systems Biology, Organization for Applied Scientific Research (TNO), Sylviusweg 71, 2333 BE, Leiden, The Netherlands
| | - Edmond J Remarque
- Biomedical Primate Research Centre, Lange Kleiweg 161, 2288 GJ, Rijswijk, The Netherlands
| | - Antoine Smits
- Biomedical Primate Research Centre, Lange Kleiweg 161, 2288 GJ, Rijswijk, The Netherlands
| | - Jan A M Langermans
- Biomedical Primate Research Centre, Lange Kleiweg 161, 2288 GJ, Rijswijk, The Netherlands
- Department Population Health Sciences, Animals in Science and Society, Faculty of Veterinary Medicine, Utrecht University, Heidelberglaan 8, 3584 CM, Utrecht, The Netherlands
| | - Jaco Bakker
- Biomedical Primate Research Centre, Lange Kleiweg 161, 2288 GJ, Rijswijk, The Netherlands
| | - Roy Montijn
- Department of Microbiology and Systems Biology, Organization for Applied Scientific Research (TNO), Sylviusweg 71, 2333 BE, Leiden, The Netherlands
| |
Collapse
|
4
|
Gorman ED, Lladser ME. Interpretable metric learning in comparative metagenomics: The adaptive Haar-like distance. PLoS Comput Biol 2024; 20:e1011543. [PMID: 38768195 PMCID: PMC11142682 DOI: 10.1371/journal.pcbi.1011543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Revised: 05/31/2024] [Accepted: 04/25/2024] [Indexed: 05/22/2024] Open
Abstract
Random forests have emerged as a promising tool in comparative metagenomics because they can predict environmental characteristics based on microbial composition in datasets where β-diversity metrics fall short of revealing meaningful relationships between samples. Nevertheless, despite this efficacy, they lack biological insight in tandem with their predictions, potentially hindering scientific advancement. To overcome this limitation, we leverage a geometric characterization of random forests to introduce a data-driven phylogenetic β-diversity metric, the adaptive Haar-like distance. This new metric assigns a weight to each internal node (i.e., split or bifurcation) of a reference phylogeny, indicating the relative importance of that node in discerning environmental samples based on their microbial composition. Alongside this, a weighted nearest-neighbors classifier, constructed using the adaptive metric, can be used as a proxy for the random forest while maintaining accuracy on par with that of the original forest and another state-of-the-art classifier, CoDaCoRe. As shown in datasets from diverse microbial environments, however, the new metric and classifier significantly enhance the biological interpretability and visualization of high-dimensional metagenomic samples.
Collapse
Affiliation(s)
- Evan D. Gorman
- Department of Applied Mathematics, University of Colorado, Boulder, Colorado, United States of America
| | - Manuel E. Lladser
- Department of Applied Mathematics, University of Colorado, Boulder, Colorado, United States of America
| |
Collapse
|
5
|
Quinn TP, Hess JL, Marshe VS, Barnett MM, Hauschild AC, Maciukiewicz M, Elsheikh SSM, Men X, Schwarz E, Trakadis YJ, Breen MS, Barnett EJ, Zhang-James Y, Ahsen ME, Cao H, Chen J, Hou J, Salekin A, Lin PI, Nicodemus KK, Meyer-Lindenberg A, Bichindaritz I, Faraone SV, Cairns MJ, Pandey G, Müller DJ, Glatt SJ. A primer on the use of machine learning to distil knowledge from data in biological psychiatry. Mol Psychiatry 2024; 29:387-401. [PMID: 38177352 PMCID: PMC11228968 DOI: 10.1038/s41380-023-02334-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Revised: 09/21/2023] [Accepted: 11/17/2023] [Indexed: 01/06/2024]
Abstract
Applications of machine learning in the biomedical sciences are growing rapidly. This growth has been spurred by diverse cross-institutional and interdisciplinary collaborations, public availability of large datasets, an increase in the accessibility of analytic routines, and the availability of powerful computing resources. With this increased access and exposure to machine learning comes a responsibility for education and a deeper understanding of its bases and bounds, borne equally by data scientists seeking to ply their analytic wares in medical research and by biomedical scientists seeking to harness such methods to glean knowledge from data. This article provides an accessible and critical review of machine learning for a biomedically informed audience, as well as its applications in psychiatry. The review covers definitions and expositions of commonly used machine learning methods, and historical trends of their use in psychiatry. We also provide a set of standards, namely Guidelines for REporting Machine Learning Investigations in Neuropsychiatry (GREMLIN), for designing and reporting studies that use machine learning as a primary data-analysis approach. Lastly, we propose the establishment of the Machine Learning in Psychiatry (MLPsych) Consortium, enumerate its objectives, and identify areas of opportunity for future applications of machine learning in biological psychiatry. This review serves as a cautiously optimistic primer on machine learning for those on the precipice as they prepare to dive into the field, either as methodological practitioners or well-informed consumers.
Collapse
Affiliation(s)
- Thomas P Quinn
- Applied Artificial Intelligence Institute (A2I2), Burwood, VIC, 3125, Australia
| | - Jonathan L Hess
- Department of Psychiatry and Behavioral Sciences, Norton College of Medicine at SUNY Upstate Medical University, Syracuse, NY, 13210, USA
| | - Victoria S Marshe
- Institute of Medical Science, University of Toronto, Toronto, ON, M5S 1A1, Canada
- Pharmacogenetics Research Clinic, Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health, Toronto, ON, M5S 1A1, Canada
| | - Michelle M Barnett
- School of Biomedical Sciences and Pharmacy, The University of Newcastle, Callaghan, NSW, 2308, Australia
- Precision Medicine Research Program, Hunter Medical Research Institute, Newcastle, NSW, 2308, Australia
| | - Anne-Christin Hauschild
- Department of Medical Informatics, Medical University Center Göttingen, Göttingen, Lower Saxony, 37075, Germany
| | - Malgorzata Maciukiewicz
- Hospital Zurich, University of Zurich, Zurich, 8091, Switzerland
- Department of Rheumatology and Immunology, University Hospital Bern, Bern, 3010, Switzerland
- Department for Biomedical Research (DBMR), University of Bern, Bern, 3010, Switzerland
| | - Samar S M Elsheikh
- Pharmacogenetics Research Clinic, Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health, Toronto, ON, M5S 1A1, Canada
| | - Xiaoyu Men
- Pharmacogenetics Research Clinic, Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health, Toronto, ON, M5S 1A1, Canada
- Department of Pharmacology and Toxicology, University of Toronto, Toronto, ON, M5S 1A1, Canada
| | - Emanuel Schwarz
- Department of Psychiatry and Psychotherapy, Central Institute of Mental Health, Mannheim, Baden-Württemberg, J5 68159, Germany
| | - Yannis J Trakadis
- Department Human Genetics, McGill University Health Centre, Montreal, QC, H4A 3J1, Canada
| | - Michael S Breen
- Psychiatry, Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
| | - Eric J Barnett
- Department of Neuroscience and Physiology, Norton College of Medicine at SUNY Upstate Medical University, Syracuse, NY, 13210, USA
| | - Yanli Zhang-James
- Department of Psychiatry and Behavioral Sciences, Norton College of Medicine at SUNY Upstate Medical University, Syracuse, NY, 13210, USA
| | - Mehmet Eren Ahsen
- Department of Business Administration, Gies College of Business, University of Illinois at Urbana-Champaign, Champaign, IL, 61820, USA
- Department of Biomedical and Translational Sciences, Carle-Illinois School of Medicine, University of Illinois at Urbana-Champaign, Champaign, IL, 61820, USA
| | - Han Cao
- Department of Psychiatry and Psychotherapy, Central Institute of Mental Health, Mannheim, Baden-Württemberg, J5 68159, Germany
| | - Junfang Chen
- Department of Psychiatry and Psychotherapy, Central Institute of Mental Health, Mannheim, Baden-Württemberg, J5 68159, Germany
| | - Jiahui Hou
- Department of Psychiatry and Behavioral Sciences, Norton College of Medicine at SUNY Upstate Medical University, Syracuse, NY, 13210, USA
- Department of Neuroscience and Physiology, Norton College of Medicine at SUNY Upstate Medical University, Syracuse, NY, 13210, USA
| | - Asif Salekin
- Electrical Engineering and Computer Science, Syracuse University, Syracuse, NY, 13244, USA
| | - Ping-I Lin
- Discipline of Psychiatry and Mental Health, University of New South Wales, Sydney, NSW, 2052, Australia
- Mental Health Research Unit, South Western Sydney Local Health District, Liverpool, NSW, 2170, Australia
| | | | - Andreas Meyer-Lindenberg
- Clinical Department of Psychiatry and Psychotherapy, Central Institute of Mental Health, Mannheim, Baden-Württemberg, J5 68159, Germany
| | - Isabelle Bichindaritz
- Biomedical and Health Informatics/Computer Science Department, State University of New York at Oswego, Oswego, NY, 13126, USA
- Intelligent Bio Systems Lab, State University of New York at Oswego, Oswego, NY, 13126, USA
| | - Stephen V Faraone
- Department of Psychiatry and Behavioral Sciences, Norton College of Medicine at SUNY Upstate Medical University, Syracuse, NY, 13210, USA
- Department of Neuroscience and Physiology, Norton College of Medicine at SUNY Upstate Medical University, Syracuse, NY, 13210, USA
| | - Murray J Cairns
- School of Biomedical Sciences and Pharmacy, The University of Newcastle, Callaghan, NSW, 2308, Australia
- Precision Medicine Research Program, Hunter Medical Research Institute, Newcastle, NSW, 2308, Australia
| | - Gaurav Pandey
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
| | - Daniel J Müller
- Pharmacogenetics Research Clinic, Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health, Toronto, ON, M5S 1A1, Canada
- Department of Psychiatry, University of Toronto, Toronto, ON, M5S 1A1, Canada
- Department of Psychiatry, Psychosomatics and Psychotherapy, Center of Mental Health, University Hospital of Würzburg, Würzburg, 97080, Germany
| | - Stephen J Glatt
- Department of Psychiatry and Behavioral Sciences, Norton College of Medicine at SUNY Upstate Medical University, Syracuse, NY, 13210, USA.
- Department of Neuroscience and Physiology, Norton College of Medicine at SUNY Upstate Medical University, Syracuse, NY, 13210, USA.
- Department of Public Health and Preventive Medicine, Norton College of Medicine at SUNY Upstate Medical University, Syracuse, NY, 13210, USA.
| |
Collapse
|
6
|
Ibrahimi E, Lopes MB, Dhamo X, Simeon A, Shigdel R, Hron K, Stres B, D’Elia D, Berland M, Marcos-Zambrano LJ. Overview of data preprocessing for machine learning applications in human microbiome research. Front Microbiol 2023; 14:1250909. [PMID: 37869650 PMCID: PMC10588656 DOI: 10.3389/fmicb.2023.1250909] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Accepted: 09/22/2023] [Indexed: 10/24/2023] Open
Abstract
Although metagenomic sequencing is now the preferred technique to study microbiome-host interactions, analyzing and interpreting microbiome sequencing data presents challenges primarily attributed to the statistical specificities of the data (e.g., sparse, over-dispersed, compositional, inter-variable dependency). This mini review explores preprocessing and transformation methods applied in recent human microbiome studies to address microbiome data analysis challenges. Our results indicate a limited adoption of transformation methods targeting the statistical characteristics of microbiome sequencing data. Instead, there is a prevalent usage of relative and normalization-based transformations that do not specifically account for the specific attributes of microbiome data. The information on preprocessing and transformations applied to the data before analysis was incomplete or missing in many publications, leading to reproducibility concerns, comparability issues, and questionable results. We hope this mini review will provide researchers and newcomers to the field of human microbiome research with an up-to-date point of reference for various data transformation tools and assist them in choosing the most suitable transformation method based on their research questions, objectives, and data characteristics.
Collapse
Affiliation(s)
- Eliana Ibrahimi
- Department of Biology, Faculty of Natural Sciences, University of Tirana, Tirana, Albania
| | - Marta B. Lopes
- Department of Mathematics, Center for Mathematics and Applications (NOVA Math), NOVA School of Science and Technology, Caparica, Portugal
- UNIDEMI, Department of Mechanical and Industrial Engineering, NOVA School of Science and Technology, Caparica, Portugal
| | - Xhilda Dhamo
- Department of Applied Mathematics, Faculty of Natural Sciences, University of Tirana, Tirana, Albania
| | - Andrea Simeon
- BioSense Institute, University of Novi Sad, Novi Sad, Serbia
| | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Karel Hron
- Department of Mathematical Analysis and Applications of Mathematics, Faculty of Science, Palacký University Olomouc, Olomouc, Czechia
| | - Blaž Stres
- Department of Catalysis and Chemical Reaction Engineering, National Institute of Chemistry, Ljubljana, Slovenia
- Faculty of Civil and Geodetic Engineering, Institute of Sanitary Engineering, Ljubljana, Slovenia
- Department of Automation, Biocybernetics and Robotics, Jožef Stefan Institute, Ljubljana, Slovenia
- Department of Animal Science, Biotechnical Faculty, University of Ljubljana, Ljubljana, Slovenia
| | - Domenica D’Elia
- Department of Biomedical Sciences, National Research Council, Institute for Biomedical Technologies, Bari, Italy
| | - Magali Berland
- INRAE, MetaGenoPolis, Université Paris-Saclay, Jouy-en-Josas, France
| | - Laura Judith Marcos-Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| |
Collapse
|
7
|
Deschênes T, Tohoundjona FWE, Plante PL, Di Marzo V, Raymond F. Gene-based microbiome representation enhances host phenotype classification. mSystems 2023; 8:e0053123. [PMID: 37404032 PMCID: PMC10469787 DOI: 10.1128/msystems.00531-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Accepted: 05/24/2023] [Indexed: 07/06/2023] Open
Abstract
With the concomitant advances in both the microbiome and machine learning fields, the gut microbiome has become of great interest for the potential discovery of biomarkers to be used in the classification of the host health status. Shotgun metagenomics data derived from the human microbiome is composed of a high-dimensional set of microbial features. The use of such complex data for the modeling of host-microbiome interactions remains a challenge as retaining de novo content yields a highly granular set of microbial features. In this study, we compared the prediction performances of machine learning approaches according to different types of data representations derived from shotgun metagenomics. These representations include commonly used taxonomic and functional profiles and the more granular gene cluster approach. For the five case-control datasets used in this study (Type 2 diabetes, obesity, liver cirrhosis, colorectal cancer, and inflammatory bowel disease), gene-based approaches, whether used alone or in combination with reference-based data types, allowed improved or similar classification performances as the taxonomic and functional profiles. In addition, we show that using subsets of gene families from specific functional categories of genes highlight the importance of these functions on the host phenotype. This study demonstrates that both reference-free microbiome representations and curated metagenomic annotations can provide relevant representations for machine learning based on metagenomic data. IMPORTANCE Data representation is an essential part of machine learning performance when using metagenomic data. In this work, we show that different microbiome representations provide varied host phenotype classification performance depending on the dataset. In classification tasks, untargeted microbiome gene content can provide similar or improved classification compared to taxonomical profiling. Feature selection based on biological function also improves classification performance for some pathologies. Function-based feature selection combined with interpretable machine learning algorithms can generate new hypotheses that can potentially be assayed mechanistically. This work thus proposes new approaches to represent microbiome data for machine learning that can potentiate the findings associated with metagenomic data.
Collapse
Affiliation(s)
- Thomas Deschênes
- Centre Nutrition, Santé et Société (NUTRISS) – Institut sur la Nutrition et les Aliments Fonctionnels (INAF), Université Laval, Québec, Canada
- Canada Research Excellence Chair on the Microbiome-Endocannabinoidome Axis in Metabolic Health (CERC-MEND), Quebec City, Quebec, Canada
- Institut Intelligence et Données, Université Laval, Québec, Canada
| | - Fred Wilfried Elom Tohoundjona
- Centre Nutrition, Santé et Société (NUTRISS) – Institut sur la Nutrition et les Aliments Fonctionnels (INAF), Université Laval, Québec, Canada
- Canada Research Excellence Chair on the Microbiome-Endocannabinoidome Axis in Metabolic Health (CERC-MEND), Quebec City, Quebec, Canada
| | - Pier-Luc Plante
- Centre Nutrition, Santé et Société (NUTRISS) – Institut sur la Nutrition et les Aliments Fonctionnels (INAF), Université Laval, Québec, Canada
- Canada Research Excellence Chair on the Microbiome-Endocannabinoidome Axis in Metabolic Health (CERC-MEND), Quebec City, Quebec, Canada
- Institut Intelligence et Données, Université Laval, Québec, Canada
| | - Vincenzo Di Marzo
- Centre Nutrition, Santé et Société (NUTRISS) – Institut sur la Nutrition et les Aliments Fonctionnels (INAF), Université Laval, Québec, Canada
- Canada Research Excellence Chair on the Microbiome-Endocannabinoidome Axis in Metabolic Health (CERC-MEND), Quebec City, Quebec, Canada
- École de nutrition, Faculté des sciences de l’agriculture et de l’alimentation (FSAA), Université Laval, Québec, Canada
- Centre de recherche de l’Institut universitaire de cardiologie et de pneumologie de Québec (IUCPQ), Québec, Canada
- Département de médecine, Faculté de Médecine, Université Laval, Québec, Canada
- Joint International Unit on Chemical and Biomolecular Research on the Microbiome and its Impact on Metabolic Health and Nutrition (UMI-MicroMeNu), Quebec City, Canada
| | - Frédéric Raymond
- Centre Nutrition, Santé et Société (NUTRISS) – Institut sur la Nutrition et les Aliments Fonctionnels (INAF), Université Laval, Québec, Canada
- Canada Research Excellence Chair on the Microbiome-Endocannabinoidome Axis in Metabolic Health (CERC-MEND), Quebec City, Quebec, Canada
- Institut Intelligence et Données, Université Laval, Québec, Canada
- École de nutrition, Faculté des sciences de l’agriculture et de l’alimentation (FSAA), Université Laval, Québec, Canada
| |
Collapse
|
8
|
Shtossel O, Isakov H, Turjeman S, Koren O, Louzoun Y. Ordering taxa in image convolution networks improves microbiome-based machine learning accuracy. Gut Microbes 2023; 15:2224474. [PMID: 37345233 PMCID: PMC10288916 DOI: 10.1080/19490976.2023.2224474] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Accepted: 06/08/2023] [Indexed: 06/23/2023] Open
Abstract
The human gut microbiome is associated with a large number of disease etiologies. As such, it is a natural candidate for machine-learning-based biomarker development for multiple diseases and conditions. The microbiome is often analyzed using 16S rRNA gene sequencing or shotgun metagenomics. However, several properties of microbial sequence-based studies hinder machine learning (ML), including non-uniform representation, a small number of samples compared with the dimension of each sample, and sparsity of the data, with the majority of taxa present in a small subset of samples. We show here using a graph representation that the cladogram structure is as informative as the taxa frequency. We then suggest a novel method to combine information from different taxa and improve data representation for ML using microbial taxonomy. iMic (image microbiome) translates the microbiome to images through an iterative ordering scheme, and applies convolutional neural networks to the resulting image. We show that iMic has a higher precision in static microbiome gene sequence-based ML than state-of-the-art methods. iMic also facilitates the interpretation of the classifiers through an explainable artificial intelligence (AI) algorithm to iMic to detect taxa relevant to each condition. iMic is then extended to dynamic microbiome samples by translating them to movies.
Collapse
Affiliation(s)
- Oshrit Shtossel
- Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel
| | - Haim Isakov
- Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel
| | - Sondra Turjeman
- The Azrieli Faculty of Medicine, Bar-Ilan University, Safed, Israel
| | - Omry Koren
- The Azrieli Faculty of Medicine, Bar-Ilan University, Safed, Israel
| | - Yoram Louzoun
- Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel
| |
Collapse
|
9
|
Loganathan T, Priya Doss C G. The influence of machine learning technologies in gut microbiome research and cancer studies - A review. Life Sci 2022; 311:121118. [DOI: 10.1016/j.lfs.2022.121118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Revised: 10/19/2022] [Accepted: 10/19/2022] [Indexed: 11/18/2022]
|
10
|
Heisel T, Johnson AJ, Gonia S, Dillon A, Skalla E, Haapala J, Jacobs KM, Nagel E, Pierce S, Fields D, Demerath E, Knights D, Gale CA. Bacterial, fungal, and interkingdom microbiome features of exclusively breastfeeding dyads are associated with infant age, antibiotic exposure, and birth mode. Front Microbiol 2022; 13:1050574. [PMID: 36466688 PMCID: PMC9714262 DOI: 10.3389/fmicb.2022.1050574] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Accepted: 10/26/2022] [Indexed: 11/19/2022] Open
Abstract
The composition and function of early life gut bacterial communities (microbiomes) have been proposed to modulate health for the long term. In addition to bacteria, fungi (mycobiomes) also colonize the early life gut and have been implicated in health disorders such as asthma and obesity. Despite the potential importance of mycobiomes in health, there has been a lack of study regarding fungi and their interkingdom interactions with bacteria during infancy. The goal of this study was to obtain a more complete understanding of microbial communities thought to be relevant for the early life programming of health. Breastmilk and infant feces were obtained from a unique cohort of healthy, exclusively breastfeeding dyads recruited as part of the Mothers and Infants Linked for Healthy Growth (MILk) study with microbial taxa characterized using amplicon-based sequencing approaches. Bacterial and fungal communities in breastmilk were both distinct from those of infant feces, consistent with niche-specific microbial community development. Nevertheless, overlap was observed among sample types (breastmilk, 1-month feces, 6-month feces) with respect to the taxa that were the most prevalent and abundant. Self-reported antibacterial antibiotic exposure was associated with micro- as well as mycobiome variation, which depended upon the subject receiving antibiotics (mother or infant), timing of exposure (prenatal, peri- or postpartum), and sample type. In addition, birth mode was associated with bacterial and fungal community variation in infant feces, but not breastmilk. Correlations between bacterial and fungal taxa abundances were identified in all sample types. For infant feces, congruency between bacterial and fungal communities was higher for older infants, consistent with the idea of co-maturation of bacterial and fungal gut communities. Interkingdom connectedness also tended to be higher in older infants. Additionally, higher interkingdom connectedness was associated with Cesarean section birth and with antibiotic exposure for microbial communities of both breastmilk and infant feces. Overall, these results implicate infant age, birth mode, and antibiotic exposure in bacterial, fungal and interkingdom relationship variation in early-life-relevant microbiomes, expanding the current literature beyond bacteria.
Collapse
Affiliation(s)
- Timothy Heisel
- Department of Pediatrics, University of Minnesota, Minneapolis, MN, United States
| | - Abigail J. Johnson
- School of Public Health, University of Minnesota, Minneapolis, MN, United States
| | - Sara Gonia
- Department of Pediatrics, University of Minnesota, Minneapolis, MN, United States
| | - Abrielle Dillon
- Department of Pediatrics, University of Minnesota, Minneapolis, MN, United States
| | - Emily Skalla
- Department of Pediatrics, University of Minnesota, Minneapolis, MN, United States,School of Public Health, University of Minnesota, Minneapolis, MN, United States
| | - Jacob Haapala
- School of Public Health, University of Minnesota, Minneapolis, MN, United States,HealthPartners Institute, Minneapolis, MN, United States
| | - Katherine M. Jacobs
- Department of Obstetrics, Gynecology, and Women’s Health, University of Minnesota, Minneapolis, MN, United States
| | - Emily Nagel
- School of Public Health, University of Minnesota, Minneapolis, MN, United States
| | - Stephanie Pierce
- College of Medicine, University of Oklahoma, Oklahoma City, OK, United States
| | - David Fields
- College of Medicine, University of Oklahoma, Oklahoma City, OK, United States
| | - Ellen Demerath
- School of Public Health, University of Minnesota, Minneapolis, MN, United States
| | - Dan Knights
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, United States
| | - Cheryl A. Gale
- Department of Pediatrics, University of Minnesota, Minneapolis, MN, United States,*Correspondence: Cheryl A. Gale,
| |
Collapse
|
11
|
Boyraz A, Pawlowsky-Glahn V, Egozcue JJ, Acar AC. Principal microbial groups: compositional alternative to phylogenetic grouping of microbiome data. Brief Bioinform 2022; 23:6675749. [PMID: 36007229 DOI: 10.1093/bib/bbac328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Revised: 07/19/2022] [Accepted: 07/20/2022] [Indexed: 11/13/2022] Open
Abstract
Statistical and machine learning techniques based on relative abundances have been used to predict health conditions and to identify microbial biomarkers. However, high dimensionality, sparsity and the compositional nature of microbiome data represent statistical challenges. On the other hand, the taxon grouping allows summarizing microbiome abundance with a coarser resolution in a lower dimension, but it presents new challenges when correlating taxa with a disease. In this work, we present a novel approach that groups Operational Taxonomical Units (OTUs) based only on relative abundances as an alternative to taxon grouping. The proposed procedure acknowledges the compositional data making use of principal balances. The identified groups are called Principal Microbial Groups (PMGs). The procedure reduces the need for user-defined aggregation of $\textrm{OTU}$s and offers the possibility of working with coarse group of $\textrm{OTU}$s, which are not present in a phylogenetic tree. PMGs can be used for two different goals: (1) as a dimensionality reduction method for compositional data, (2) as an aggregation procedure that provides an alternative to taxon grouping for construction of microbial balances afterward used for disease prediction. We illustrate the procedure with a cirrhosis study data. PMGs provide a coherent data analysis for the search of biomarkers in human microbiota. The source code and demo data for PMGs are available at: https://github.com/asliboyraz/PMGs.
Collapse
Affiliation(s)
- Aslı Boyraz
- Department of Computer Programming, Recep Tayyip Erdoğan University, Ardeşen Vocational School, Rize, 53400, Turkey
| | - Vera Pawlowsky-Glahn
- Department of Computer Sciences, Applied Mathematics and Statistics, University of Girona, Campus Montilivi, 17003 Girona, Spain
| | - Juan José Egozcue
- Department of Civil and Environmental Engineering, Universitat Politécnica de Catalunya, Barcelona, 08034, Spain
| | - Aybar Can Acar
- Department of Medical Informatics, Middle East Technical University, Ankara Turkey
| |
Collapse
|
12
|
Leske M, Bottacini F, Afli H, Andrade BGN. BiGAMi: Bi-Objective Genetic Algorithm Fitness Function for Feature Selection on Microbiome Datasets. Methods Protoc 2022; 5:42. [PMID: 35645350 PMCID: PMC9149982 DOI: 10.3390/mps5030042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 05/16/2022] [Accepted: 05/18/2022] [Indexed: 11/23/2022] Open
Abstract
The relationship between the host and the microbiome, or the assemblage of microorganisms (including bacteria, archaea, fungi, and viruses), has been proven crucial for its health and disease development. The high dimensionality of microbiome datasets has often been addressed as a major difficulty for data analysis, such as the use of machine-learning (ML) and deep-learning (DL) models. Here, we present BiGAMi, a bi-objective genetic algorithm fitness function for feature selection in microbial datasets to train high-performing phenotype classifiers. The proposed fitness function allowed us to build classifiers that outperformed the baseline performance estimated by the original studies by using as few as 0.04% to 2.32% features of the original dataset. In 35 out of 42 performance comparisons between BiGAMi and other feature selection methods evaluated here (sequential forward selection, SelectKBest, and GARS), BiGAMi achieved its results by selecting 6-93% fewer features. This study showed that the application of a bi-objective GA fitness function against microbiome datasets succeeded in selecting small subsets of bacteria whose contribution to understood diseases and the host state was already experimentally proven. Applying this feature selection approach to novel diseases is expected to quickly reveal the microbes most relevant to a specific condition.
Collapse
Affiliation(s)
- Mike Leske
- Department of Computer Sciences, Munster Technological University, MTU/ADAPT, T12 P928 Cork, Ireland;
| | - Francesca Bottacini
- Department of Biological Sciences, Munster Technological University, MTU, T12 P928 Cork, Ireland;
| | - Haithem Afli
- Department of Computer Sciences, Munster Technological University, MTU/ADAPT, T12 P928 Cork, Ireland;
| | - Bruno G. N. Andrade
- Department of Computer Sciences, Munster Technological University, MTU/ADAPT, T12 P928 Cork, Ireland;
| |
Collapse
|
13
|
Agostinetto G, Bozzi D, Porro D, Casiraghi M, Labra M, Bruno A. SKIOME Project: a curated collection of skin microbiome datasets enriched with study-related metadata. Database (Oxford) 2022; 2022:6586378. [PMID: 35576001 PMCID: PMC9216470 DOI: 10.1093/database/baac033] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2021] [Revised: 02/25/2022] [Accepted: 05/09/2022] [Indexed: 04/07/2023]
Abstract
Large amounts of data from microbiome-related studies have been (and are currently being) deposited on international public databases. These datasets represent a valuable resource for the microbiome research community and could serve future researchers interested in integrating multiple datasets into powerful meta-analyses. However, this huge amount of data lacks harmonization and it is far from being completely exploited in its full potential to build a foundation that places microbiome research at the nexus of many subdisciplines within and beyond biology. Thus, it urges the need for data accessibility and reusability, according to findable, accessible, interoperable and reusable (FAIR) principles, as supported by National Microbiome Data Collaborative and FAIR Microbiome. To tackle the challenge of accelerating discovery and advances in skin microbiome research, we collected, integrated and organized existing microbiome data resources from human skin 16S rRNA amplicon-sequencing experiments. We generated a comprehensive collection of datasets, enriched in metadata, and organized this information into data frames ready to be integrated into microbiome research projects and advanced post-processing analyses, such as data science applications (e.g. machine learning). Furthermore, we have created a data retrieval and curation framework built on three different stages to maximize the retrieval of datasets and metadata associated with them. Lastly, we highlighted some caveats regarding metadata retrieval and suggested ways to improve future metadata submissions. Overall, our work resulted in a curated skin microbiome datasets collection accompanied by a state-of-the-art analysis of the last 10 years of the skin microbiome field. Database URL: https://github.com/giuliaago/SKIOMEMetadataRetrieval.
Collapse
Affiliation(s)
- Giulia Agostinetto
- *Corresponding author: Giulia Agostinetto. E-mail: and Antonia Bruno. Tel: +0039 0264483413; E-mail:
| | | | - Danilo Porro
- Department of Biotechnology and Biosciences, University of Milano-Bicocca, Piazza della Scienza, 2, Milan 20126, Italy
- Institute of Molecular Bioimaging and Physiology (IBFM), National Research Council (CNR), via Fratelli Cervi, 93, Segrate (MI) 20054, Italy
| | - Maurizio Casiraghi
- Department of Biotechnology and Biosciences, University of Milano-Bicocca, Piazza della Scienza, 2, Milan 20126, Italy
| | - Massimo Labra
- Department of Biotechnology and Biosciences, University of Milano-Bicocca, Piazza della Scienza, 2, Milan 20126, Italy
| | - Antonia Bruno
- *Corresponding author: Giulia Agostinetto. E-mail: and Antonia Bruno. Tel: +0039 0264483413; E-mail:
| |
Collapse
|
14
|
Chen X, Zhu Z, Zhang W, Wang Y, Wang F, Yang J, Wong KC. Human disease prediction from microbiome data by multiple feature fusion and deep learning. iScience 2022; 25:104081. [PMID: 35372808 PMCID: PMC8971930 DOI: 10.1016/j.isci.2022.104081] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2021] [Revised: 09/16/2021] [Accepted: 03/13/2022] [Indexed: 10/29/2022] Open
Abstract
Human disease prediction from microbiome data has broad implications in metagenomics. It is rare for the existing methods to consider abundance profiles from both known and unknown microbial organisms, or capture the taxonomic relationships among microbial taxa, leading to significant information loss. On the other hand, deep learning has shown unprecedented advantages in classification tasks for its feature-learning ability. However, it encounters the opposite situation in metagenome-based disease prediction since high-dimensional low-sample-size metagenomic datasets can lead to severe overfitting; and black-box model fails in providing biological explanations. To circumvent the related problems, we developed MetaDR, a comprehensive machine learning-based framework that integrates various information and deep learning to predict human diseases. Experimental results indicate that MetaDR achieves competitive prediction performance with a reduction in running time, and effectively discovers the informative features with biological insights.
Collapse
Affiliation(s)
- Xingjian Chen
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Zifan Zhu
- Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, USA
| | - Weitong Zhang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Yuchen Wang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Fuzhou Wang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Jianyi Yang
- School of Mathematical Sciences, Nankai University, Tianjin, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR.,Hong Kong Institute for Data Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| |
Collapse
|
15
|
Host phenotype classification from human microbiome data is mainly driven by the presence of microbial taxa. PLoS Comput Biol 2022; 18:e1010066. [PMID: 35446845 PMCID: PMC9064115 DOI: 10.1371/journal.pcbi.1010066] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Revised: 05/03/2022] [Accepted: 03/29/2022] [Indexed: 12/14/2022] Open
Abstract
Machine learning-based classification approaches are widely used to predict host phenotypes from microbiome data. Classifiers are typically employed by considering operational taxonomic units or relative abundance profiles as input features. Such types of data are intrinsically sparse, which opens the opportunity to make predictions from the presence/absence rather than the relative abundance of microbial taxa. This also poses the question whether it is the presence rather than the abundance of particular taxa to be relevant for discrimination purposes, an aspect that has been so far overlooked in the literature. In this paper, we aim at filling this gap by performing a meta-analysis on 4,128 publicly available metagenomes associated with multiple case-control studies. At species-level taxonomic resolution, we show that it is the presence rather than the relative abundance of specific microbial taxa to be important when building classification models. Such findings are robust to the choice of the classifier and confirmed by statistical tests applied to identifying differentially abundant/present taxa. Results are further confirmed at coarser taxonomic resolutions and validated on 4,026 additional 16S rRNA samples coming from 30 public case-control studies. The composition of the human microbiome has been linked to a large number of different diseases. In this context, classification methodologies based on machine learning approaches have represented a promising tool for diagnostic purposes from metagenomics data. The link between microbial population composition and host phenotypes has been usually performed by considering taxonomic profiles represented by relative abundances of microbial species. In this study, we show that it is more the presence rather than the relative abundance of microbial taxa to be relevant to maximize classification accuracy. This is accomplished by conducting a meta-analysis on more than 4,000 shotgun metagenomes coming from 25 case-control studies and in which original relative abundance data are degraded to presence/absence profiles. Findings are also extended to 16S rRNA data and advance the research field in building prediction models directly from human microbiome data.
Collapse
|
16
|
Chen Y, Li J, Zhang Y, Zhang M, Sun Z, Jing G, Huang S, Su X. Parallel-Meta Suite: Interactive and rapid microbiome data analysis on multiple platforms. IMETA 2022; 1:e1. [PMID: 38867729 PMCID: PMC10989749 DOI: 10.1002/imt2.1] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/30/2021] [Revised: 12/13/2021] [Accepted: 12/17/2021] [Indexed: 06/14/2024]
Abstract
Massive microbiome sequencing data has been generated, which elucidates associations between microbes and their environmental phenotypes such as host health or ecosystem status. Outstanding bioinformatic tools are the basis to decipher the biological information hidden under microbiome data. However, most approaches placed difficulties on the accessibility to nonprofessional users. On the other side, the computing throughput has become a significant bottleneck of many analytical pipelines in processing large-scale datasets. In this study, we introduce Parallel-Meta Suite (PMS), an interactive software package for fast and comprehensive microbiome data analysis, visualization, and interpretation. It covers a wide array of functions for data preprocessing, statistics, visualization by state-of-the-art algorithms in a user-friendly graphical interface, which is accessible to diverse users. To meet the rapidly increasing computational demands, the entire procedure of PMS has been optimized by a parallel computing scheme, enabling the rapid processing of thousands of samples. PMS is compatible with multiple platforms, and an installer has been integrated for full-automatic installation.
Collapse
Affiliation(s)
- Yuzhu Chen
- College of Computer Science and TechnologyQingdao UniversityQingdaoShandongChina
| | - Jian Li
- College of Computer Science and TechnologyQingdao UniversityQingdaoShandongChina
| | - Yufeng Zhang
- College of Computer Science and TechnologyQingdao UniversityQingdaoShandongChina
| | - Mingqian Zhang
- College of Computer Science and TechnologyQingdao UniversityQingdaoShandongChina
| | - Zheng Sun
- Single‐Cell Center, Qingdao Institute of BioEnergy and Bioprocess TechnologyChinese Academy of SciencesQingdaoShandongChina
| | - Gongchao Jing
- Single‐Cell Center, Qingdao Institute of BioEnergy and Bioprocess TechnologyChinese Academy of SciencesQingdaoShandongChina
| | - Shi Huang
- Faculty of DentistryThe University of Hong KongHong KongHong Kong SARChina
| | - Xiaoquan Su
- College of Computer Science and TechnologyQingdao UniversityQingdaoShandongChina
- Single‐Cell Center, Qingdao Institute of BioEnergy and Bioprocess TechnologyChinese Academy of SciencesQingdaoShandongChina
| |
Collapse
|
17
|
Liu B, Huang L, Liu Z, Pan X, Cui Z, Pan J, Xie L. EasyMicroPlot: An Efficient and Convenient R Package in Microbiome Downstream Analysis and Visualization for Clinical Study. Front Genet 2022; 12:803627. [PMID: 35058973 PMCID: PMC8764268 DOI: 10.3389/fgene.2021.803627] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Accepted: 12/02/2021] [Indexed: 01/03/2023] Open
Abstract
Advances in next-generation sequencing (NGS) have revolutionized microbial studies in many fields, especially in clinical investigation. As the second human genome, microbiota has been recognized as a new approach and perspective to understand the biological and pathologic basis of various diseases. However, massive amounts of sequencing data remain a huge challenge to researchers, especially those who are unfamiliar with microbial data analysis. The mathematic algorithm and approaches introduced from another scientific field will bring a bewildering array of computational tools and acquire higher quality of script experience. Moreover, a large cohort research together with extensive meta-data including age, body mass index (BMI), gender, medical results, and others related to subjects also aggravate this situation. Thus, it is necessary to develop an efficient and convenient software for clinical microbiome data analysis. EasyMicroPlot (EMP) package aims to provide an easy-to-use microbial analysis tool based on R platform that accomplishes the core tasks of metagenomic downstream analysis, specially designed by incorporation of popular microbial analysis and visualization used in clinical microbial studies. To illustrate how EMP works, 694 bio-samples from Guangdong Gut Microbiome Project (GGMP) were selected and analyzed with EMP package. Our analysis demonstrated the influence of dietary style on gut microbiota and proved EMP package's powerful ability and excellent convenience to address problems for this field.
Collapse
Affiliation(s)
- Bingdong Liu
- The First Affiliated Hospital of Jinan University, Guangzhou, China.,State Key Laboratory of Applied Microbiology Southern China, Guangdong Provincial Key Laboratory of Microbial Culture Collection and Application, Guangdong Open Laboratory of Applied Microbiology, Institute of Microbiology, Guangdong Academy of Sciences, Guangzhou, China
| | - Liujing Huang
- State Key Laboratory of Applied Microbiology Southern China, Guangdong Provincial Key Laboratory of Microbial Culture Collection and Application, Guangdong Open Laboratory of Applied Microbiology, Institute of Microbiology, Guangdong Academy of Sciences, Guangzhou, China.,Zhujiang Hospital, Southern Medical University, Guangzhou, China
| | - Zhihong Liu
- State Key Laboratory of Applied Microbiology Southern China, Guangdong Provincial Key Laboratory of Microbial Culture Collection and Application, Guangdong Open Laboratory of Applied Microbiology, Institute of Microbiology, Guangdong Academy of Sciences, Guangzhou, China
| | - Xiaohan Pan
- Department of Applied Biology and Chemical Technology, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China
| | - Zongbing Cui
- State Key Laboratory of Applied Microbiology Southern China, Guangdong Provincial Key Laboratory of Microbial Culture Collection and Application, Guangdong Open Laboratory of Applied Microbiology, Institute of Microbiology, Guangdong Academy of Sciences, Guangzhou, China
| | - Jiyang Pan
- The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Liwei Xie
- State Key Laboratory of Applied Microbiology Southern China, Guangdong Provincial Key Laboratory of Microbial Culture Collection and Application, Guangdong Open Laboratory of Applied Microbiology, Institute of Microbiology, Guangdong Academy of Sciences, Guangzhou, China.,Zhujiang Hospital, Southern Medical University, Guangzhou, China.,School of Public Health, Xinxiang Medical University, Xinxiang, China
| |
Collapse
|
18
|
Zhao L, Cho WC, Nicolls MR. Colorectal Cancer-Associated Microbiome Patterns and Signatures. Front Genet 2022; 12:787176. [PMID: 35003221 PMCID: PMC8729777 DOI: 10.3389/fgene.2021.787176] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Accepted: 12/07/2021] [Indexed: 01/02/2023] Open
Abstract
The gut microbiome is dynamic and shaped by diet, age, geography, and environment. The disruption of normal gut microbiota (dysbiosis) is closely related to colorectal cancer (CRC) risk and progression. To better identify and characterize CRC-associated dysbiosis, we collected six independent cohorts with matched normal pairs (when available) for comparison and exploration of the microbiota and their interactions with the host. Comparing the microbial community compositions between cancerous and adjacent noncancerous tissues, we found that more microbes were depleted than enriched in tumors. Despite taxonomic variations among cohorts, consistent depletion of normal microbiota (members of Clostridia and Bacteroidia) and significant enrichment of oral-originated pathogens (such as Fusobacterium nucleatum and Parvimonas micra) were observed in CRC compared to normal tissues. Sets of hub and hub-connecting microbes were subsequently identified to infer microbe-microbe interaction networks in CRC. Furthermore, biclustering was used for identifying coherent patterns between patients and microbes. Two patient-microbe interaction patterns, named P0 and P1, can be consistently identified among the investigated six CRC cohorts. Characterization of the microbial community composition of the two patterns revealed that patients in P0 and P1 differed significantly in microbial alpha and beta diversity, and CRC‐associated microbiota changes consist of continuous populations of widespread taxa rather than discrete enterotypes. In contrast to the P0, the patients in P1 have reduced microbial alpha diversity compared to the adjacent normal tissues, and P1 possesses more oral-related pathogens than P0 and controls. Collectively, our study investigated the CRC-associated microbiome changes, and identified reproducible microbial signatures across multiple independent cohorts. More importantly, we revealed that the CRC heterogeneity can be partially attributed to the variety and compositional differences of microbes and their interactions to humans.
Collapse
Affiliation(s)
- Lan Zhao
- Department of Medicine, Stanford University School of Medicine, Stanford, CA, United States.,VA Palo Alto Health Care System, Palo Alto, CA, United States
| | - William C Cho
- Department of Clinical Oncology, Queen Elizabeth Hospital, Hong Kong, China
| | - Mark R Nicolls
- Department of Medicine, Stanford University School of Medicine, Stanford, CA, United States.,VA Palo Alto Health Care System, Palo Alto, CA, United States
| |
Collapse
|
19
|
Giulia A, Anna S, Antonia B, Dario P, Maurizio C. Extending Association Rule Mining to Microbiome Pattern Analysis: Tools and Guidelines to Support Real Applications. FRONTIERS IN BIOINFORMATICS 2022; 1:794547. [PMID: 36303759 PMCID: PMC9580939 DOI: 10.3389/fbinf.2021.794547] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Accepted: 12/07/2021] [Indexed: 11/24/2022] Open
Abstract
Boosted by the exponential growth of microbiome-based studies, analyzing microbiome patterns is now a hot-topic, finding different fields of application. In particular, the use of machine learning techniques is increasing in microbiome studies, providing deep insights into microbial community composition. In this context, in order to investigate microbial patterns from 16S rRNA metabarcoding data, we explored the effectiveness of Association Rule Mining (ARM) technique, a supervised-machine learning procedure, to extract patterns (in this work, intended as groups of species or taxa) from microbiome data. ARM can generate huge amounts of data, making spurious information removal and visualizing results challenging. Our work sheds light on the strengths and weaknesses of pattern mining strategy into the study of microbial patterns, in particular from 16S rRNA microbiome datasets, applying ARM on real case studies and providing guidelines for future usage. Our results highlighted issues related to the type of input and the use of metadata in microbial pattern extraction, identifying the key steps that must be considered to apply ARM consciously on 16S rRNA microbiome data. To promote the use of ARM and the visualization of microbiome patterns, specifically, we developed microFIM (microbial Frequent Itemset Mining), a versatile Python tool that facilitates the use of ARM integrating common microbiome outputs, such as taxa tables. microFIM implements interest measures to remove spurious information and merges the results of ARM analysis with the common microbiome outputs, providing similar microbiome strategies that help scientists to integrate ARM in microbiome applications. With this work, we aimed at creating a bridge between microbial ecology researchers and ARM technique, making researchers aware about the strength and weaknesses of association rule mining approach.
Collapse
Affiliation(s)
- Agostinetto Giulia
- Department of Biotechnology and Biosciences, University of Milano-Bicocca, Milan, Italy
- *Correspondence: Agostinetto Giulia,
| | | | - Bruno Antonia
- Department of Biotechnology and Biosciences, University of Milano-Bicocca, Milan, Italy
| | - Pescini Dario
- Department of Statistics and Quantitative Methods, University of Milano-Bicocca, Milan, Italy
| | - Casiraghi Maurizio
- Department of Biotechnology and Biosciences, University of Milano-Bicocca, Milan, Italy
| |
Collapse
|
20
|
Gordon-Rodriguez E, Quinn TP, Cunningham JP. Learning sparse log-ratios for high-throughput sequencing data. Bioinformatics 2021; 38:157-163. [PMID: 34498030 PMCID: PMC8696089 DOI: 10.1093/bioinformatics/btab645] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2021] [Revised: 08/09/2021] [Accepted: 09/03/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION The automatic discovery of sparse biomarkers that are associated with an outcome of interest is a central goal of bioinformatics. In the context of high-throughput sequencing (HTS) data, and compositional data (CoDa) more generally, an important class of biomarkers are the log-ratios between the input variables. However, identifying predictive log-ratio biomarkers from HTS data is a combinatorial optimization problem, which is computationally challenging. Existing methods are slow to run and scale poorly with the dimension of the input, which has limited their application to low- and moderate-dimensional metagenomic datasets. RESULTS Building on recent advances from the field of deep learning, we present CoDaCoRe, a novel learning algorithm that identifies sparse, interpretable and predictive log-ratio biomarkers. Our algorithm exploits a continuous relaxation to approximate the underlying combinatorial optimization problem. This relaxation can then be optimized efficiently using the modern ML toolbox, in particular, gradient descent. As a result, CoDaCoRe runs several orders of magnitude faster than competing methods, all while achieving state-of-the-art performance in terms of predictive accuracy and sparsity. We verify the outperformance of CoDaCoRe across a wide range of microbiome, metabolite and microRNA benchmark datasets, as well as a particularly high-dimensional dataset that is outright computationally intractable for existing sparse log-ratio selection methods. AVAILABILITY AND IMPLEMENTATION The CoDaCoRe package is available at https://github.com/egr95/R-codacore. Code and instructions for reproducing our results are available at https://github.com/cunningham-lab/codacore. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Thomas P Quinn
- Applied Artificial Intelligence Institute, Deakin University, Geelong, VIC 3126, Australia
| | - John P Cunningham
- Department of Statistics, Columbia University, New York, NY 10025, USA
| |
Collapse
|
21
|
Dupras C, Bunnik EM. Toward a Framework for Assessing Privacy Risks in Multi-Omic Research and Databases. THE AMERICAN JOURNAL OF BIOETHICS : AJOB 2021; 21:46-64. [PMID: 33433298 DOI: 10.1080/15265161.2020.1863516] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
While the accumulation and increased circulation of genomic data have captured much attention over the past decade, privacy risks raised by the diversification and integration of omics have been largely overlooked. In this paper, we propose the outline of a framework for assessing privacy risks in multi-omic research and databases. Following a comparison of privacy risks associated with genomic and epigenomic data, we dissect ten privacy risk-impacting omic data properties that affect either the risk of re-identification of research participants, or the sensitivity of the information potentially conveyed by biological data. We then propose a three-step approach for the assessment of privacy risks in the multi-omic era. Thus, we lay grounds for a data property-based, 'pan-omic' approach that moves away from genetic exceptionalism. We conclude by inviting our peers to refine these theoretical foundations, put them to the test in their respective fields, and translate our approach into practical guidance.
Collapse
|
22
|
Yang F, Zou Q. mAML: an automated machine learning pipeline with a microbiome repository for human disease classification. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2020:5862399. [PMID: 32588040 PMCID: PMC7316531 DOI: 10.1093/database/baaa050] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/11/2020] [Revised: 05/27/2020] [Accepted: 06/03/2020] [Indexed: 12/20/2022]
Abstract
Due to the concerted efforts to utilize the microbial features to improve disease prediction capabilities, automated machine learning (AutoML) systems aiming to get rid of the tediousness in manually performing ML tasks are in great demand. Here we developed mAML, an ML model-building pipeline, which can automatically and rapidly generate optimized and interpretable models for personalized microbiome-based classification tasks in a reproducible way. The pipeline is deployed on a web-based platform, while the server is user-friendly and flexible and has been designed to be scalable according to the specific requirements. This pipeline exhibits high performance for 13 benchmark datasets including both binary and multi-class classification tasks. In addition, to facilitate the application of mAML and expand the human disease-related microbiome learning repository, we developed GMrepo ML repository (GMrepo Microbiome Learning repository) from the GMrepo database. The repository involves 120 microbiome-based classification tasks for 85 human-disease phenotypes referring to 12 429 metagenomic samples and 38 643 amplicon samples. The mAML pipeline and the GMrepo ML repository are expected to be important resources for researches in microbiology and algorithm developments. Database URL: http://lab.malab.cn/soft/mAML
Collapse
Affiliation(s)
- Fenglong Yang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No. 4, Section 2, North Jianshe Road, Chengdu 610054, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No. 4, Section 2, North Jianshe Road, Chengdu 610054, China
| |
Collapse
|
23
|
García-Jiménez B, Muñoz J, Cabello S, Medina J, Wilkinson MD. Predicting microbiomes through a deep latent space. Bioinformatics 2021; 37:1444-1451. [PMID: 33289510 PMCID: PMC8208755 DOI: 10.1093/bioinformatics/btaa971] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Revised: 10/21/2020] [Accepted: 11/06/2020] [Indexed: 12/28/2022] Open
Abstract
Motivation Microbial communities influence their environment by modifying the availability of compounds, such as nutrients or chemical elicitors. Knowing the microbial composition of a site is therefore relevant to improve productivity or health. However, sequencing facilities are not always available, or may be prohibitively expensive in some cases. Thus, it would be desirable to computationally predict the microbial composition from more accessible, easily-measured features. Results Integrating deep learning techniques with microbiome data, we propose an artificial neural network architecture based on heterogeneous autoencoders to condense the long vector of microbial abundance values into a deep latent space representation. Then, we design a model to predict the deep latent space and, consequently, to predict the complete microbial composition using environmental features as input. The performance of our system is examined using the rhizosphere microbiome of Maize. We reconstruct the microbial composition (717 taxa) from the deep latent space (10 values) with high fidelity (>0.9 Pearson correlation). We then successfully predict microbial composition from environmental variables, such as plant age, temperature or precipitation (0.73 Pearson correlation, 0.42 Bray–Curtis). We extend this to predict microbiome composition under hypothetical scenarios, such as future climate change conditions. Finally, via transfer learning, we predict microbial composition in a distinct scenario with only 100 sequences, and distinct environmental features. We propose that our deep latent space may assist microbiome-engineering strategies when technical or financial resources are limited, through predicting current or future microbiome compositions. Availability and implementation Software, results and data are available at https://github.com/jorgemf/DeepLatentMicrobiome Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Beatriz García-Jiménez
- Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223-Pozuelo de Alarcón, Madrid, Spain
| | - Jorge Muñoz
- Serendeepia Research, 28905 Getafe (Madrid), Spain
| | - Sara Cabello
- Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223-Pozuelo de Alarcón, Madrid, Spain
| | - Joaquín Medina
- Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223-Pozuelo de Alarcón, Madrid, Spain
| | - Mark D Wilkinson
- Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223-Pozuelo de Alarcón, Madrid, Spain.,Departamento de Biotecnología-Biología Vegetal, Escuela Técnica Superior de Ingeniería Agronómica, Alimentaria y de Biosistemas, Universidad Politécnica de Madrid (UPM), Madrid, Spain
| |
Collapse
|
24
|
Chen X, Liu L, Zhang W, Yang J, Wong KC. Human host status inference from temporal microbiome changes via recurrent neural networks. Brief Bioinform 2021; 22:6307015. [PMID: 34151933 DOI: 10.1093/bib/bbab223] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 04/21/2021] [Accepted: 04/21/2021] [Indexed: 01/04/2023] Open
Abstract
With the rapid increase in sequencing data, human host status inference (e.g. healthy or sick) from microbiome data has become an important issue. Existing studies are mostly based on single-point microbiome composition, while it is rare that the host status is predicted from longitudinal microbiome data. However, single-point-based methods cannot capture the dynamic patterns between the temporal changes and host status. Therefore, it remains challenging to build good predictive models as well as scaling to different microbiome contexts. On the other hand, existing methods are mainly targeted for disease prediction and seldom investigate other host statuses. To fill the gap, we propose a comprehensive deep learning-based framework that utilizes longitudinal microbiome data as input to infer the human host status. Specifically, the framework is composed of specific data preparation strategies and a recurrent neural network tailored for longitudinal microbiome data. In experiments, we evaluated the proposed method on both semi-synthetic and real datasets based on different sequencing technologies and metagenomic contexts. The results indicate that our method achieves robust performance compared to other baseline and state-of-the-art classifiers and provides a significant reduction in prediction time.
Collapse
Affiliation(s)
- Xingjian Chen
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| | - Lingjing Liu
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| | - Weitong Zhang
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| | - Jianyi Yang
- School of Mathematical Sciences, Nankai University, Kowloon, Hong Kong SAR
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| |
Collapse
|
25
|
DiMucci D, Kon M, Segrè D. BowSaw: Inferring Higher-Order Trait Interactions Associated With Complex Biological Phenotypes. Front Mol Biosci 2021; 8:663532. [PMID: 34222331 PMCID: PMC8245782 DOI: 10.3389/fmolb.2021.663532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Accepted: 05/24/2021] [Indexed: 11/15/2022] Open
Abstract
Machine learning is helping the interpretation of biological complexity by enabling the inference and classification of cellular, organismal and ecological phenotypes based on large datasets, e.g., from genomic, transcriptomic and metagenomic analyses. A number of available algorithms can help search these datasets to uncover patterns associated with specific traits, including disease-related attributes. While, in many instances, treating an algorithm as a black box is sufficient, it is interesting to pursue an enhanced understanding of how system variables end up contributing to a specific output, as an avenue toward new mechanistic insight. Here we address this challenge through a suite of algorithms, named BowSaw, which takes advantage of the structure of a trained random forest algorithm to identify combinations of variables (“rules”) frequently used for classification. We first apply BowSaw to a simulated dataset and show that the algorithm can accurately recover the sets of variables used to generate the phenotypes through complex Boolean rules, even under challenging noise levels. We next apply our method to data from the integrative Human Microbiome Project and find previously unreported high-order combinations of microbial taxa putatively associated with Crohn’s disease. By leveraging the structure of trees within a random forest, BowSaw provides a new way of using decision trees to generate testable biological hypotheses.
Collapse
Affiliation(s)
- Demetrius DiMucci
- Bioinformatics Graduate Program, Boston University, Boston, MA, United States.,Biological Design Center, Boston University, Boston, MA, United States
| | - Mark Kon
- Bioinformatics Graduate Program, Boston University, Boston, MA, United States.,Department of Mathematics and Statistics, Boston University, Boston, MA, United States
| | - Daniel Segrè
- Bioinformatics Graduate Program, Boston University, Boston, MA, United States.,Biological Design Center, Boston University, Boston, MA, United States.,Department of Biology, Boston University, Boston, MA, United States.,Department of Biomedical Engineering, Boston University, Boston, MA, United States.,Department of Physics, Boston University, Boston, MA, United States
| |
Collapse
|
26
|
Liu YX, Qin Y, Chen T, Lu M, Qian X, Guo X, Bai Y. A practical guide to amplicon and metagenomic analysis of microbiome data. Protein Cell 2021; 12:315-330. [PMID: 32394199 PMCID: PMC8106563 DOI: 10.1007/s13238-020-00724-8] [Citation(s) in RCA: 346] [Impact Index Per Article: 115.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2020] [Accepted: 04/10/2020] [Indexed: 12/22/2022] Open
Abstract
Advances in high-throughput sequencing (HTS) have fostered rapid developments in the field of microbiome research, and massive microbiome datasets are now being generated. However, the diversity of software tools and the complexity of analysis pipelines make it difficult to access this field. Here, we systematically summarize the advantages and limitations of microbiome methods. Then, we recommend specific pipelines for amplicon and metagenomic analyses, and describe commonly-used software and databases, to help researchers select the appropriate tools. Furthermore, we introduce statistical and visualization methods suitable for microbiome analysis, including alpha- and beta-diversity, taxonomic composition, difference comparisons, correlation, networks, machine learning, evolution, source tracing, and common visualization styles to help researchers make informed choices. Finally, a step-by-step reproducible analysis guide is introduced. We hope this review will allow researchers to carry out data analysis more effectively and to quickly select the appropriate tools in order to efficiently mine the biological significance behind the data.
Collapse
Affiliation(s)
- Yong-Xin Liu
- State Key Laboratory of Plant Genomics, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, 100101, China.
- CAS Center for Excellence in Biotic Interactions, University of Chinese Academy of Sciences, Beijing, 100049, China.
- CAS-JIC Centre of Excellence for Plant and Microbial Science, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, 100101, China.
| | - Yuan Qin
- State Key Laboratory of Plant Genomics, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, 100101, China
- CAS Center for Excellence in Biotic Interactions, University of Chinese Academy of Sciences, Beijing, 100049, China
- CAS-JIC Centre of Excellence for Plant and Microbial Science, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, 100101, China
- College of Advanced Agricultural Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Tong Chen
- National Resource Center for Chinese Materia Medica, China Academy of Chinese Medical Sciences, Beijing, 100700, China
| | - Meiping Lu
- Department of Rheumatology Immunology & Allergy, Children's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang Province, 310053, China
| | - Xubo Qian
- Department of Rheumatology Immunology & Allergy, Children's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang Province, 310053, China
| | - Xiaoxuan Guo
- State Key Laboratory of Plant Genomics, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, 100101, China
- CAS Center for Excellence in Biotic Interactions, University of Chinese Academy of Sciences, Beijing, 100049, China
- CAS-JIC Centre of Excellence for Plant and Microbial Science, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, 100101, China
| | - Yang Bai
- State Key Laboratory of Plant Genomics, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, 100101, China.
- CAS Center for Excellence in Biotic Interactions, University of Chinese Academy of Sciences, Beijing, 100049, China.
- CAS-JIC Centre of Excellence for Plant and Microbial Science, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, 100101, China.
- College of Advanced Agricultural Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China.
| |
Collapse
|
27
|
Wu S, Chen Y, Li Z, Li J, Zhao F, Su X. Towards multi-label classification: Next step of machine learning for microbiome research. Comput Struct Biotechnol J 2021; 19:2742-2749. [PMID: 34093989 PMCID: PMC8131981 DOI: 10.1016/j.csbj.2021.04.054] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2021] [Revised: 04/21/2021] [Accepted: 04/22/2021] [Indexed: 11/22/2022] Open
Abstract
Machine learning (ML) has been widely used in microbiome research for biomarker selection and disease prediction. By training microbial profiles of samples from patients and healthy controls, ML classifiers constructs data models by community features that highly correlated with the target diseases, so as to determine the status of new samples. To clearly understand the host-microbe interaction of specific diseases, previous studies always focused on well-designed cohorts, in which each sample was exactly labeled by a single status type. However, in fact an individual may be associated with multiple diseases simultaneously, which introduce additional variations on microbial patterns that interferes the status detection. More importantly, comorbidities or complications can be missed by regular ML models, limiting the practical application of microbiome techniques. In this review, we summarize the typical ML approaches of single-label classification for microbiome research, and demonstrate their limitations in multi-label disease detection using a real dataset. Then we prospect a further step of ML towards multi-label classification that potentially solves the aforementioned problem, including a series of promising strategies and key technical issues for applying multi-label classification in microbiome-based studies.
Collapse
Affiliation(s)
- Shunyao Wu
- College of Computer Science and Technology, Qingdao University, Qingdao, Shandong 266071, China
| | - Yuzhu Chen
- College of Computer Science and Technology, Qingdao University, Qingdao, Shandong 266071, China
| | - Zhiruo Li
- School of Mathematics and Statistics, Qingdao University, Qingdao, Shandong 266071, China
| | - Jian Li
- College of Computer Science and Technology, Qingdao University, Qingdao, Shandong 266071, China
| | - Fengyang Zhao
- College of Computer Science and Technology, Qingdao University, Qingdao, Shandong 266071, China
| | - Xiaoquan Su
- College of Computer Science and Technology, Qingdao University, Qingdao, Shandong 266071, China
| |
Collapse
|
28
|
Zhang W, Chen X, Wong KC. Noninvasive early diagnosis of intestinal diseases based on artificial intelligence in genomics and microbiome. J Gastroenterol Hepatol 2021; 36:823-831. [PMID: 33880763 DOI: 10.1111/jgh.15500] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/10/2021] [Revised: 03/15/2021] [Accepted: 03/17/2021] [Indexed: 12/15/2022]
Abstract
The maturing development in artificial intelligence (AI) and genomics has propelled the advances in intestinal diseases including intestinal cancer, inflammatory bowel disease (IBD), and irritable bowel syndrome (IBS). On the other hand, colorectal cancer is the second most deadly and the third most common type of cancer in the world according to GLOBOCAN 2020 data. The mechanisms behind IBD and IBS are still speculative. The conventional methods to identify colorectal cancer, IBD, and IBS are based on endoscopy or colonoscopy to identify lesions. However, it is invasive, demanding, and time-consuming for early-stage intestinal diseases. To address those problems, new strategies based on blood and/or human microbiome in gut, colon, or even feces were developed; those methods took advantage of high-throughput sequencing and machine learning approaches. In this review, we summarize the recent research and methods to diagnose intestinal diseases with machine learning technologies based on cell-free DNA and microbiome data generated by amplicon sequencing or whole-genome sequencing. Those methods play an important role in not only intestinal disease diagnosis but also therapy development in the near future.
Collapse
Affiliation(s)
- Weitong Zhang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Xingjian Chen
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR.,Hong Kong Institute for Data Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| |
Collapse
|
29
|
Shestopaloff K, Dong M, Gao F, Xu W. DCMD: Distance-based classification using mixture distributions on microbiome data. PLoS Comput Biol 2021; 17:e1008799. [PMID: 33711013 PMCID: PMC7990174 DOI: 10.1371/journal.pcbi.1008799] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Revised: 03/24/2021] [Accepted: 02/15/2021] [Indexed: 11/21/2022] Open
Abstract
Current advances in next-generation sequencing techniques have allowed researchers to conduct comprehensive research on the microbiome and human diseases, with recent studies identifying associations between the human microbiome and health outcomes for a number of chronic conditions. However, microbiome data structure, characterized by sparsity and skewness, presents challenges to building effective classifiers. To address this, we present an innovative approach for distance-based classification using mixture distributions (DCMD). The method aims to improve classification performance using microbiome community data, where the predictors are composed of sparse and heterogeneous count data. This approach models the inherent uncertainty in sparse counts by estimating a mixture distribution for the sample data and representing each observation as a distribution, conditional on observed counts and the estimated mixture, which are then used as inputs for distance-based classification. The method is implemented into a k-means classification and k-nearest neighbours framework. We develop two distance metrics that produce optimal results. The performance of the model is assessed using simulated and human microbiome study data, with results compared against a number of existing machine learning and distance-based classification approaches. The proposed method is competitive when compared to the other machine learning approaches, and shows a clear improvement over commonly used distance-based classifiers, underscoring the importance of modelling sparsity for achieving optimal results. The range of applicability and robustness make the proposed method a viable alternative for classification using sparse microbiome count data. The source code is available at https://github.com/kshestop/DCMD for academic use. The uneven performance of conventional distanced-based classifiers when using microbiome profiles to predict disease status has motivated us to develop a novel distance-based method that accounts for uncertainty when modeling sparse counts. We propose a classification algorithm that uses mixture distributions to measure normed distances between microbiome distributions, which better models the underlying structure by handling excess zeros and sparsity inherent in microbial abundance counts. Applications of DCMD have shown improved classification performance and robustness, making the proposed method an improved alternative for classification using microbiome data.
Collapse
Affiliation(s)
| | - Mei Dong
- Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, CANADA
| | - Fan Gao
- Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, CANADA
| | - Wei Xu
- Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, CANADA
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, CANADA
- * E-mail:
| |
Collapse
|
30
|
Marcos-Zambrano LJ, Karaduzovic-Hadziabdic K, Loncar Turukalo T, Przymus P, Trajkovik V, Aasmets O, Berland M, Gruca A, Hasic J, Hron K, Klammsteiner T, Kolev M, Lahti L, Lopes MB, Moreno V, Naskinova I, Org E, Paciência I, Papoutsoglou G, Shigdel R, Stres B, Vilne B, Yousef M, Zdravevski E, Tsamardinos I, Carrillo de Santa Pau E, Claesson MJ, Moreno-Indias I, Truu J. Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment. Front Microbiol 2021; 12:634511. [PMID: 33737920 PMCID: PMC7962872 DOI: 10.3389/fmicb.2021.634511] [Citation(s) in RCA: 126] [Impact Index Per Article: 42.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Accepted: 02/01/2021] [Indexed: 12/19/2022] Open
Abstract
The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.
Collapse
Affiliation(s)
- Laura Judith Marcos-Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | | | | | - Piotr Przymus
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruń, Poland
| | - Vladimir Trajkovik
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Skopje, North Macedonia
| | - Oliver Aasmets
- Institute of Genomics, Estonian Genome Centre, University of Tartu, Tartu, Estonia
- Department of Biotechnology, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | - Magali Berland
- Université Paris-Saclay, INRAE, MGP, Jouy-en-Josas, France
| | - Aleksandra Gruca
- Department of Computer Networks and Systems, Silesian University of Technology, Gliwice, Poland
| | - Jasminka Hasic
- University Sarajevo School of Science and Technology, Sarajevo, Bosnia and Herzegovina
| | - Karel Hron
- Department of Mathematical Analysis and Applications of Mathematics, Palacký University, Olomouc, Czechia
| | | | - Mikhail Kolev
- South West University “Neofit Rilski”, Blagoevgrad, Bulgaria
| | - Leo Lahti
- Department of Computing, University of Turku, Turku, Finland
| | - Marta B. Lopes
- NOVA Laboratory for Computer Science and Informatics (NOVA LINCS), FCT, UNL, Caparica, Portugal
- Centro de Matemática e Aplicações (CMA), FCT, UNL, Caparica, Portugal
| | - Victor Moreno
- Oncology Data Analytics Program, Catalan Institute of Oncology (ICO)Barcelona, Spain
- Colorectal Cancer Group, Institut de Recerca Biomedica de Bellvitge (IDIBELL), Barcelona, Spain
- Consortium for Biomedical Research in Epidemiology and Public Health (CIBERESP), Barcelona, Spain
- Department of Clinical Sciences, Faculty of Medicine, University of Barcelona, Barcelona, Spain
| | - Irina Naskinova
- South West University “Neofit Rilski”, Blagoevgrad, Bulgaria
| | - Elin Org
- Institute of Genomics, Estonian Genome Centre, University of Tartu, Tartu, Estonia
| | - Inês Paciência
- EPIUnit – Instituto de Saúde Pública da Universidade do Porto, Porto, Portugal
| | | | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Blaz Stres
- Group for Microbiology and Microbial Biotechnology, Department of Animal Science, University of Ljubljana, Ljubljana, Slovenia
| | - Baiba Vilne
- Bioinformatics Research Unit, Riga Stradins University, Riga, Latvia
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel
| | - Eftim Zdravevski
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Skopje, North Macedonia
| | | | | | - Marcus J. Claesson
- School of Microbiology & APC Microbiome Ireland, University College Cork, Cork, Ireland
| | - Isabel Moreno-Indias
- Unidad de Gestión Clínica de Endocrinología y Nutrición, Instituto de Investigación Biomédica de Málaga (IBIMA), Hospital Clínico Universitario Virgen de la Victoria, Universidad de Málaga, Málaga, Spain
- Centro de Investigación Biomédica en Red de Fisiopatología de la Obesidad y la Nutrición (CIBEROBN), Instituto de Salud Carlos III, Madrid, Spain
| | - Jaak Truu
- Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| |
Collapse
|
31
|
Ghannam RB, Techtmann SM. Machine learning applications in microbial ecology, human microbiome studies, and environmental monitoring. Comput Struct Biotechnol J 2021; 19:1092-1107. [PMID: 33680353 PMCID: PMC7892807 DOI: 10.1016/j.csbj.2021.01.028] [Citation(s) in RCA: 89] [Impact Index Per Article: 29.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Revised: 01/16/2021] [Accepted: 01/18/2021] [Indexed: 01/04/2023] Open
Abstract
Advances in nucleic acid sequencing technology have enabled expansion of our ability to profile microbial diversity. These large datasets of taxonomic and functional diversity are key to better understanding microbial ecology. Machine learning has proven to be a useful approach for analyzing microbial community data and making predictions about outcomes including human and environmental health. Machine learning applied to microbial community profiles has been used to predict disease states in human health, environmental quality and presence of contamination in the environment, and as trace evidence in forensics. Machine learning has appeal as a powerful tool that can provide deep insights into microbial communities and identify patterns in microbial community data. However, often machine learning models can be used as black boxes to predict a specific outcome, with little understanding of how the models arrived at predictions. Complex machine learning algorithms often may value higher accuracy and performance at the sacrifice of interpretability. In order to leverage machine learning into more translational research related to the microbiome and strengthen our ability to extract meaningful biological information, it is important for models to be interpretable. Here we review current trends in machine learning applications in microbial ecology as well as some of the important challenges and opportunities for more broad application of machine learning to understanding microbial communities.
Collapse
Key Words
- 16S rRNA
- ANN, Artificial Neural Networks
- ASV, Amplicon Sequence Variant
- AUC, Area Under the Curve
- Forensics
- GB, Gradient Boosting
- ML, Machine Learning
- Machine learning
- Marker genes
- Metagenomics
- PCoA, Principal Coordinate Analysis
- RF, Random Forests
- ROC, Receiver Operating Characteristic
- SML, Supervised Machine Learning
- SVM, Support Vector Machines
- USML, Unsupervised Machine Learning
- tSNE, t-distributed Stochastic Neighbor Embedding
Collapse
Affiliation(s)
- Ryan B. Ghannam
- Department of Biological Sciences, Michigan Technological University, Houghton MI, United States
| | - Stephen M. Techtmann
- Department of Biological Sciences, Michigan Technological University, Houghton MI, United States
| |
Collapse
|
32
|
Reiman D, Farhat AM, Dai Y. Predicting Host Phenotype Based on Gut Microbiome Using a Convolutional Neural Network Approach. Methods Mol Biol 2021; 2190:249-266. [PMID: 32804370 DOI: 10.1007/978-1-0716-0826-5_12] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Accurate prediction of the host phenotypes from a microbial sample and identification of the associated microbial markers are important in understanding the impact of the microbiome on the pathogenesis and progression of various diseases within the host. A deep learning tool, PopPhy-CNN, has been developed for the task of predicting host phenotypes using a convolutional neural network (CNN). By representing samples as annotated taxonomic trees and further representing these trees as matrices, PopPhy-CNN utilizes the CNN's innate ability to explore locally similar microbes on the taxonomic tree. Furthermore, PopPhy-CNN can be used to evaluate the importance of each taxon in the prediction of host status. Here, we describe the underlying methodology, architecture, and core utility of PopPhy-CNN. We also demonstrate the use of PopPhy-CNN on a microbial dataset.
Collapse
Affiliation(s)
- Derek Reiman
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, USA
| | - Ali M Farhat
- College of Medicine, University of Illinois at Chicago, Chicago, IL, USA
| | - Yang Dai
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, USA.
| |
Collapse
|
33
|
Ghosh A, Firdous S, Saha S. Bioinformatics for Human Microbiome. Adv Bioinformatics 2021. [DOI: 10.1007/978-981-33-6191-1_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022] Open
|
34
|
Bokulich NA, Ziemski M, Robeson MS, Kaehler BD. Measuring the microbiome: Best practices for developing and benchmarking microbiomics methods. Comput Struct Biotechnol J 2020; 18:4048-4062. [PMID: 33363701 PMCID: PMC7744638 DOI: 10.1016/j.csbj.2020.11.049] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2020] [Revised: 11/27/2020] [Accepted: 11/28/2020] [Indexed: 12/12/2022] Open
Abstract
Microbiomes are integral components of diverse ecosystems, and increasingly recognized for their roles in the health of humans, animals, plants, and other hosts. Given their complexity (both in composition and function), the effective study of microbiomes (microbiomics) relies on the development, optimization, and validation of computational methods for analyzing microbial datasets, such as from marker-gene (e.g., 16S rRNA gene) and metagenome data. This review describes best practices for benchmarking and implementing computational methods (and software) for studying microbiomes, with particular focus on unique characteristics of microbiomes and microbiomics data that should be taken into account when designing and testing microbiomics methods.
Collapse
Affiliation(s)
- Nicholas A. Bokulich
- Laboratory of Food Systems Biotechnology, Institute of Food, Nutrition, and Health, ETH Zurich, Switzerland
| | - Michal Ziemski
- Laboratory of Food Systems Biotechnology, Institute of Food, Nutrition, and Health, ETH Zurich, Switzerland
| | - Michael S. Robeson
- University of Arkansas for Medical Sciences, Department of Biomedical Informatics, Little Rock, AR, USA
| | | |
Collapse
|
35
|
Abstract
AbstractThis article aims to provide a thorough overview of the use of Artificial Intelligence (AI) techniques in studying the gut microbiota and its role in the diagnosis and treatment of some important diseases. The association between microbiota and diseases, together with its clinical relevance, is still difficult to interpret. The advances in AI techniques, such as Machine Learning (ML) and Deep Learning (DL), can help clinicians in processing and interpreting these massive data sets. Two research groups have been involved in this Scoping Review, working in two different areas of Europe: Florence and Sarajevo. The papers included in the review describe the use of ML or DL methods applied to the study of human gut microbiota. In total, 1109 papers were considered in this study. After elimination, a final set of 16 articles was considered in the scoping review. Different AI techniques were applied in the reviewed papers. Some papers applied ML, while others applied DL techniques. 11 papers evaluated just different ML algorithms (ranging from one to eight algorithms applied to one dataset). The remaining five papers examined both ML and DL algorithms. The most applied ML algorithm was Random Forest and it also exhibited the best performances.
Collapse
|
36
|
De Filippis F, Pasolli E, Ercolini D. The food-gut axis: lactic acid bacteria and their link to food, the gut microbiome and human health. FEMS Microbiol Rev 2020; 44:454-489. [PMID: 32556166 PMCID: PMC7391071 DOI: 10.1093/femsre/fuaa015] [Citation(s) in RCA: 115] [Impact Index Per Article: 28.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2020] [Accepted: 05/20/2020] [Indexed: 12/18/2022] Open
Abstract
Lactic acid bacteria (LAB) are present in foods, the environment and the animal gut, although fermented foods (FFs) are recognized as the primary niche of LAB activity. Several LAB strains have been studied for their health-promoting properties and are employed as probiotics. FFs are recognized for their potential beneficial effects, which we review in this article. They are also an important source of LAB, which are ingested daily upon FF consumption. In this review, we describe the diversity of LAB and their occurrence in food as well as the gut microbiome. We discuss the opportunities to study LAB diversity and functional properties by considering the availability of both genomic and metagenomic data in public repositories, as well as the different latest computational tools for data analysis. In addition, we discuss the role of LAB as potential probiotics by reporting the prevalence of key genomic features in public genomes and by surveying the outcomes of LAB use in clinical trials involving human subjects. Finally, we highlight the need for further studies aimed at improving our knowledge of the link between LAB-fermented foods and the human gut from the perspective of health promotion.
Collapse
Affiliation(s)
- Francesca De Filippis
- Department of Agricultural Sciences, University of Naples Federico II, via Università, 100, 80055, Portici (NA)Italy
- Task Force on Microbiome Studies, Corso Umberto I, 40, 80100, Napoli, Italy
| | - Edoardo Pasolli
- Department of Agricultural Sciences, University of Naples Federico II, via Università, 100, 80055, Portici (NA)Italy
- Task Force on Microbiome Studies, Corso Umberto I, 40, 80100, Napoli, Italy
| | - Danilo Ercolini
- Department of Agricultural Sciences, University of Naples Federico II, via Università, 100, 80055, Portici (NA)Italy
- Task Force on Microbiome Studies, Corso Umberto I, 40, 80100, Napoli, Italy
| |
Collapse
|
37
|
Topçuoğlu BD, Lesniak NA, Ruffin MT, Wiens J, Schloss PD. A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems. mBio 2020; 11:e00434-20. [PMID: 32518182 PMCID: PMC7373189 DOI: 10.1128/mbio.00434-20] [Citation(s) in RCA: 75] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2020] [Accepted: 05/06/2020] [Indexed: 12/12/2022] Open
Abstract
Machine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made toward developing ML models that predict health outcomes using bacterial abundances, but inconsistent adoption of training and evaluation methods call the validity of these models into question. Furthermore, there appears to be a preference by many researchers to favor increased model complexity over interpretability. To overcome these challenges, we trained seven models that used fecal 16S rRNA sequence data to predict the presence of colonic screen relevant neoplasias (SRNs) (n = 490 patients, 261 controls and 229 cases). We developed a reusable open-source pipeline to train, validate, and interpret ML models. To show the effect of model selection, we assessed the predictive performance, interpretability, and training time of L2-regularized logistic regression, L1- and L2-regularized support vector machines (SVM) with linear and radial basis function kernels, a decision tree, random forest, and gradient boosted trees (XGBoost). The random forest model performed best at detecting SRNs with an area under the receiver operating characteristic curve (AUROC) of 0.695 (interquartile range [IQR], 0.651 to 0.739) but was slow to train (83.2 h) and not inherently interpretable. Despite its simplicity, L2-regularized logistic regression followed random forest in predictive performance with an AUROC of 0.680 (IQR, 0.625 to 0.735), trained faster (12 min), and was inherently interpretable. Our analysis highlights the importance of choosing an ML approach based on the goal of the study, as the choice will inform expectations of performance and interpretability.IMPORTANCE Diagnosing diseases using machine learning (ML) is rapidly being adopted in microbiome studies. However, the estimated performance associated with these models is likely overoptimistic. Moreover, there is a trend toward using black box models without a discussion of the difficulty of interpreting such models when trying to identify microbial biomarkers of disease. This work represents a step toward developing more-reproducible ML practices in applying ML to microbiome research. We implement a rigorous pipeline and emphasize the importance of selecting ML models that reflect the goal of the study. These concepts are not particular to the study of human health but can also be applied to environmental microbiology studies.
Collapse
Affiliation(s)
- Begüm D Topçuoğlu
- Department of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan, USA
| | - Nicholas A Lesniak
- Department of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan, USA
| | - Mack T Ruffin
- Department of Family Medicine and Community Medicine, Penn State Hershey Medical Center, Hershey, Pennsylvania, USA
| | - Jenna Wiens
- Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan, USA
| | - Patrick D Schloss
- Department of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
38
|
Vangay P, Hillmann BM, Knights D. Microbiome Learning Repo (ML Repo): A public repository of microbiome regression and classification tasks. Gigascience 2019; 8:giz042. [PMID: 31042284 PMCID: PMC6493971 DOI: 10.1093/gigascience/giz042] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2018] [Revised: 02/24/2019] [Accepted: 03/26/2019] [Indexed: 01/05/2023] Open
Abstract
The use of machine learning in high-dimensional biological applications, such as the human microbiome, has grown exponentially in recent years, but algorithm developers often lack the domain expertise required for interpretation and curation of the heterogeneous microbiome datasets. We present Microbiome Learning Repo (ML Repo, available at https://knights-lab.github.io/MLRepo/), a public, web-based repository of 33 curated classification and regression tasks from 15 published human microbiome datasets. We highlight the use of ML Repo in several use cases to demonstrate its wide application, and we expect it to be an important resource for algorithm developers.
Collapse
Affiliation(s)
- Pajau Vangay
- Bioinformatics and Computational Biology, University of Minnesota, 200 Union Street SE, Minneapolis, MN 55455
| | - Benjamin M Hillmann
- Department of Computer Science and Engineering, University of Minnesota, 200 Union Street SE, Minneapolis, MN 55455
| | - Dan Knights
- Bioinformatics and Computational Biology, University of Minnesota, 200 Union Street SE, Minneapolis, MN 55455
- Department of Computer Science and Engineering, University of Minnesota, 200 Union Street SE, Minneapolis, MN 55455
| |
Collapse
|