1
|
Ibrahimi E, Lopes MB, Dhamo X, Simeon A, Shigdel R, Hron K, Stres B, D’Elia D, Berland M, Marcos-Zambrano LJ. Overview of data preprocessing for machine learning applications in human microbiome research. Front Microbiol 2023; 14:1250909. [PMID: 37869650 PMCID: PMC10588656 DOI: 10.3389/fmicb.2023.1250909] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Accepted: 09/22/2023] [Indexed: 10/24/2023] Open
Abstract
Although metagenomic sequencing is now the preferred technique to study microbiome-host interactions, analyzing and interpreting microbiome sequencing data presents challenges primarily attributed to the statistical specificities of the data (e.g., sparse, over-dispersed, compositional, inter-variable dependency). This mini review explores preprocessing and transformation methods applied in recent human microbiome studies to address microbiome data analysis challenges. Our results indicate a limited adoption of transformation methods targeting the statistical characteristics of microbiome sequencing data. Instead, there is a prevalent usage of relative and normalization-based transformations that do not specifically account for the specific attributes of microbiome data. The information on preprocessing and transformations applied to the data before analysis was incomplete or missing in many publications, leading to reproducibility concerns, comparability issues, and questionable results. We hope this mini review will provide researchers and newcomers to the field of human microbiome research with an up-to-date point of reference for various data transformation tools and assist them in choosing the most suitable transformation method based on their research questions, objectives, and data characteristics.
Collapse
Affiliation(s)
- Eliana Ibrahimi
- Department of Biology, Faculty of Natural Sciences, University of Tirana, Tirana, Albania
| | - Marta B. Lopes
- Department of Mathematics, Center for Mathematics and Applications (NOVA Math), NOVA School of Science and Technology, Caparica, Portugal
- UNIDEMI, Department of Mechanical and Industrial Engineering, NOVA School of Science and Technology, Caparica, Portugal
| | - Xhilda Dhamo
- Department of Applied Mathematics, Faculty of Natural Sciences, University of Tirana, Tirana, Albania
| | - Andrea Simeon
- BioSense Institute, University of Novi Sad, Novi Sad, Serbia
| | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Karel Hron
- Department of Mathematical Analysis and Applications of Mathematics, Faculty of Science, Palacký University Olomouc, Olomouc, Czechia
| | - Blaž Stres
- Department of Catalysis and Chemical Reaction Engineering, National Institute of Chemistry, Ljubljana, Slovenia
- Faculty of Civil and Geodetic Engineering, Institute of Sanitary Engineering, Ljubljana, Slovenia
- Department of Automation, Biocybernetics and Robotics, Jožef Stefan Institute, Ljubljana, Slovenia
- Department of Animal Science, Biotechnical Faculty, University of Ljubljana, Ljubljana, Slovenia
| | - Domenica D’Elia
- Department of Biomedical Sciences, National Research Council, Institute for Biomedical Technologies, Bari, Italy
| | - Magali Berland
- INRAE, MetaGenoPolis, Université Paris-Saclay, Jouy-en-Josas, France
| | - Laura Judith Marcos-Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| |
Collapse
|
2
|
Marcos-Zambrano LJ, Karaduzovic-Hadziabdic K, Loncar Turukalo T, Przymus P, Trajkovik V, Aasmets O, Berland M, Gruca A, Hasic J, Hron K, Klammsteiner T, Kolev M, Lahti L, Lopes MB, Moreno V, Naskinova I, Org E, Paciência I, Papoutsoglou G, Shigdel R, Stres B, Vilne B, Yousef M, Zdravevski E, Tsamardinos I, Carrillo de Santa Pau E, Claesson MJ, Moreno-Indias I, Truu J. Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment. Front Microbiol 2021; 12:634511. [PMID: 33737920 PMCID: PMC7962872 DOI: 10.3389/fmicb.2021.634511] [Citation(s) in RCA: 126] [Impact Index Per Article: 42.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Accepted: 02/01/2021] [Indexed: 12/19/2022] Open
Abstract
The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.
Collapse
Affiliation(s)
- Laura Judith Marcos-Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | | | | | - Piotr Przymus
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruń, Poland
| | - Vladimir Trajkovik
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Skopje, North Macedonia
| | - Oliver Aasmets
- Institute of Genomics, Estonian Genome Centre, University of Tartu, Tartu, Estonia
- Department of Biotechnology, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | - Magali Berland
- Université Paris-Saclay, INRAE, MGP, Jouy-en-Josas, France
| | - Aleksandra Gruca
- Department of Computer Networks and Systems, Silesian University of Technology, Gliwice, Poland
| | - Jasminka Hasic
- University Sarajevo School of Science and Technology, Sarajevo, Bosnia and Herzegovina
| | - Karel Hron
- Department of Mathematical Analysis and Applications of Mathematics, Palacký University, Olomouc, Czechia
| | | | - Mikhail Kolev
- South West University “Neofit Rilski”, Blagoevgrad, Bulgaria
| | - Leo Lahti
- Department of Computing, University of Turku, Turku, Finland
| | - Marta B. Lopes
- NOVA Laboratory for Computer Science and Informatics (NOVA LINCS), FCT, UNL, Caparica, Portugal
- Centro de Matemática e Aplicações (CMA), FCT, UNL, Caparica, Portugal
| | - Victor Moreno
- Oncology Data Analytics Program, Catalan Institute of Oncology (ICO)Barcelona, Spain
- Colorectal Cancer Group, Institut de Recerca Biomedica de Bellvitge (IDIBELL), Barcelona, Spain
- Consortium for Biomedical Research in Epidemiology and Public Health (CIBERESP), Barcelona, Spain
- Department of Clinical Sciences, Faculty of Medicine, University of Barcelona, Barcelona, Spain
| | - Irina Naskinova
- South West University “Neofit Rilski”, Blagoevgrad, Bulgaria
| | - Elin Org
- Institute of Genomics, Estonian Genome Centre, University of Tartu, Tartu, Estonia
| | - Inês Paciência
- EPIUnit – Instituto de Saúde Pública da Universidade do Porto, Porto, Portugal
| | | | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Blaz Stres
- Group for Microbiology and Microbial Biotechnology, Department of Animal Science, University of Ljubljana, Ljubljana, Slovenia
| | - Baiba Vilne
- Bioinformatics Research Unit, Riga Stradins University, Riga, Latvia
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel
| | - Eftim Zdravevski
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Skopje, North Macedonia
| | | | | | - Marcus J. Claesson
- School of Microbiology & APC Microbiome Ireland, University College Cork, Cork, Ireland
| | - Isabel Moreno-Indias
- Unidad de Gestión Clínica de Endocrinología y Nutrición, Instituto de Investigación Biomédica de Málaga (IBIMA), Hospital Clínico Universitario Virgen de la Victoria, Universidad de Málaga, Málaga, Spain
- Centro de Investigación Biomédica en Red de Fisiopatología de la Obesidad y la Nutrición (CIBEROBN), Instituto de Salud Carlos III, Madrid, Spain
| | - Jaak Truu
- Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| |
Collapse
|
3
|
Dobbs JT, Kim MS, Dudley NS, Klopfenstein NB, Yeh A, Hauff RD, Jones TC, Dumroese RK, Cannon PG, Stewart JE. Whole genome analysis of the koa wilt pathogen (Fusarium oxysporum f. sp. koae) and the development of molecular tools for early detection and monitoring. BMC Genomics 2020; 21:764. [PMID: 33148175 PMCID: PMC7640661 DOI: 10.1186/s12864-020-07156-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2020] [Accepted: 10/15/2020] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Development and application of DNA-based methods to distinguish highly virulent isolates of Fusarium oxysporum f. sp. koae [Fo koae; cause of koa wilt disease on Acacia koa (koa)] will help disease management through early detection, enhanced monitoring, and improved disease resistance-breeding programs. RESULTS This study presents whole genome analyses of one highly virulent Fo koae isolate and one non-pathogenic F. oxysporum (Fo) isolate. These analyses allowed for the identification of putative lineage-specific DNA and predicted genes necessary for disease development on koa. Using putative chromosomes and predicted gene comparisons, Fo koae-exclusive, virulence genes were identified. The putative lineage-specific DNA included identified genes encoding products secreted in xylem (e. g., SIX1 and SIX6) that may be necessary for disease development on koa. Unique genes from Fo koae were used to develop pathogen-specific PCR primers. These diagnostic primers allowed target amplification in the characterized highly virulent Fo koae isolates but did not allow product amplification in low-virulence or non-pathogenic isolates of Fo. Thus, primers developed in this study will be useful for early detection and monitoring of highly virulent strains of Fo koae. Isolate verification is also important for disease resistance-breeding programs that require a diverse set of highly virulent Fo koae isolates for their disease-screening assays to develop disease-resistant koa. CONCLUSIONS These results provide the framework for understanding the pathogen genes necessary for koa wilt disease and the genetic variation of Fo koae populations across the Hawaiian Islands.
Collapse
Affiliation(s)
- John T. Dobbs
- Colorado State University, Department of Agricultural Biology, 1177 Campus Delivery, Fort Collins, CO 80523 USA
| | - Mee-Sook Kim
- USDA Forest Service, Pacific Northwest Research Station, 3200 SW Jefferson Way, Corvallis, OR 97331 USA
| | - Nicklos S. Dudley
- Hawai‘i Agriculture Research Center, Maunawili Research Station, Oahu, HI USA
| | - Ned B. Klopfenstein
- USDA Forest Service, Rocky Mountain Research Station, 1221 South Main Street, Moscow, ID 83843 USA
| | - Aileen Yeh
- Hawai‘i Agriculture Research Center, Maunawili Research Station, Oahu, HI USA
| | - Robert D. Hauff
- Division of Forestry and Wildlife, Department of Land and Natural Resources, 1151 Punchbowl Street, Room 325, Honolulu, HI 96813 USA
| | - Tyler C. Jones
- Hawai‘i Agriculture Research Center, Maunawili Research Station, Oahu, HI USA
| | - R. Kasten Dumroese
- USDA Forest Service, Rocky Mountain Research Station, 1221 South Main Street, Moscow, ID 83843 USA
| | - Philip G. Cannon
- USDA Forest Service, Forest Health Protection, 1323 Club Drive, Vallejo, CA 94592 USA
| | - Jane E. Stewart
- Colorado State University, Department of Agricultural Biology, 1177 Campus Delivery, Fort Collins, CO 80523 USA
| |
Collapse
|
4
|
Krüger A, Schäfers C, Busch P, Antranikian G. Digitalization in microbiology - Paving the path to sustainable circular bioeconomy. N Biotechnol 2020; 59:88-96. [PMID: 32750680 DOI: 10.1016/j.nbt.2020.06.004] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Revised: 06/26/2020] [Accepted: 06/27/2020] [Indexed: 11/17/2022]
Abstract
The transition to a sustainable bio-based circular economy requires cutting edge technologies that ensure economic growth with environmentally responsible action. This transition will only be feasible when the opportunities of digitalization are also exploited. Digital methods and big data handling have already found their way into life sciences and generally offer huge potential in various research areas. While computational analyses of microbial metagenome data have become state of the art, the true potential of bioinformatics remains mostly untapped so far. In this article we present challenges and opportunities of digitalization including multi-omics approaches in discovering and exploiting the microbial diversity of the planet with the aim to identify robust biocatalysts for application in sustainable bioprocesses as part of the transition from a fossil-based to a bio-based circular economy. This will contribute to solving global challenges, including utilization of natural resources, food supply, health, energy and the environment.
Collapse
Affiliation(s)
- Anna Krüger
- Institute of Technical Microbiology, Hamburg University of Technology (TUHH), Kasernenstr. 12, D-21073 Hamburg, Germany.
| | - Christian Schäfers
- Institute of Technical Microbiology, Hamburg University of Technology (TUHH), Kasernenstr. 12, D-21073 Hamburg, Germany.
| | - Philip Busch
- Institute of Technical Microbiology, Hamburg University of Technology (TUHH), Kasernenstr. 12, D-21073 Hamburg, Germany.
| | - Garabed Antranikian
- Institute of Technical Microbiology, Hamburg University of Technology (TUHH), Kasernenstr. 12, D-21073 Hamburg, Germany.
| |
Collapse
|
5
|
Blanco-Míguez A, Fdez-Riverola F, Sánchez B, Lourenço A. Resources and tools for the high-throughput, multi-omic study of intestinal microbiota. Brief Bioinform 2020; 20:1032-1056. [PMID: 29186315 DOI: 10.1093/bib/bbx156] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2017] [Revised: 10/23/2017] [Indexed: 12/18/2022] Open
Abstract
The human gut microbiome impacts several aspects of human health and disease, including digestion, drug metabolism and the propensity to develop various inflammatory, autoimmune and metabolic diseases. Many of the molecular processes that play a role in the activity and dynamics of the microbiota go beyond species and genic composition and thus, their understanding requires advanced bioinformatics support. This article aims to provide an up-to-date view of the resources and software tools that are being developed and used in human gut microbiome research, in particular data integration and systems-level analysis efforts. These efforts demonstrate the power of standardized and reproducible computational workflows for integrating and analysing varied omics data and gaining deeper insights into microbe community structure and function as well as host-microbe interactions.
Collapse
Affiliation(s)
| | | | | | - Anália Lourenço
- Dpto. de Informática - Universidade de Vigo, ESEI - Escuela Superior de Ingeniería Informática, Edificio politécnico, Campus Universitario As Lagoas s/n, 32004 Ourense, Spain
| |
Collapse
|