1
|
Moreno-Indias I, Lahti L, Nedyalkova M, Elbere I, Roshchupkin G, Adilovic M, Aydemir O, Bakir-Gungor B, Santa Pau ECD, D’Elia D, Desai MS, Falquet L, Gundogdu A, Hron K, Klammsteiner T, Lopes MB, Marcos-Zambrano LJ, Marques C, Mason M, May P, Pašić L, Pio G, Pongor S, Promponas VJ, Przymus P, Saez-Rodriguez J, Sampri A, Shigdel R, Stres B, Suharoschi R, Truu J, Truică CO, Vilne B, Vlachakis D, Yilmaz E, Zeller G, Zomer AL, Gómez-Cabrero D, Claesson MJ. Statistical and Machine Learning Techniques in Human Microbiome Studies: Contemporary Challenges and Solutions. Front Microbiol 2021; 12:635781. [PMID: 33692771 PMCID: PMC7937616 DOI: 10.3389/fmicb.2021.635781] [Citation(s) in RCA: 39] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Accepted: 01/28/2021] [Indexed: 12/23/2022] Open
Abstract
The human microbiome has emerged as a central research topic in human biology and biomedicine. Current microbiome studies generate high-throughput omics data across different body sites, populations, and life stages. Many of the challenges in microbiome research are similar to other high-throughput studies, the quantitative analyses need to address the heterogeneity of data, specific statistical properties, and the remarkable variation in microbiome composition across individuals and body sites. This has led to a broad spectrum of statistical and machine learning challenges that range from study design, data processing, and standardization to analysis, modeling, cross-study comparison, prediction, data science ecosystems, and reproducible reporting. Nevertheless, although many statistics and machine learning approaches and tools have been developed, new techniques are needed to deal with emerging applications and the vast heterogeneity of microbiome data. We review and discuss emerging applications of statistical and machine learning techniques in human microbiome studies and introduce the COST Action CA18131 "ML4Microbiome" that brings together microbiome researchers and machine learning experts to address current challenges such as standardization of analysis pipelines for reproducibility of data analysis results, benchmarking, improvement, or development of existing and new tools and ontologies.
Collapse
Affiliation(s)
- Isabel Moreno-Indias
- Instituto de Investigación Biomédica de Málaga (IBIMA), Unidad de Gestión Clìnica de Endocrinologìa y Nutrición, Hospital Clìnico Universitario Virgen de la Victoria, Universidad de Málaga, Málaga, Spain
- Centro de Investigación Biomeìdica en Red de Fisiopatologtìa de la Obesidad y la Nutrición (CIBEROBN), Instituto de Salud Carlos III, Madrid, Spain
| | - Leo Lahti
- Department of Computing, University of Turku, Turku, Finland
| | - Miroslava Nedyalkova
- Human Genetics and Disease Mechanisms, Latvian Biomedical Research and Study Centre, Riga, Latvia
| | - Ilze Elbere
- Latvian Biomedical Research and Study Centre, Riga, Latvia
| | | | - Muhamed Adilovic
- Department of Genetics and Bioengineering, International University of Sarajevo, Sarajevo, Bosnia and Herzegovina
| | - Onder Aydemir
- Department of Electrical and Electronics Engineering, Karadeniz Technical University, Trabzon, Turkey
| | - Burcu Bakir-Gungor
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| | | | - Domenica D’Elia
- Department for Biomedical Sciences, Institute for Biomedical Technologies, National Research Council, Bari, Italy
| | - Mahesh S. Desai
- Department of Infection and Immunity, Luxembourg Institute of Health, Esch-sur-Alzette, Luxembourg
- Odense Research Center for Anaphylaxis, Department of Dermatology and Allergy Center, Odense University Hospital, University of Southern Denmark, Odense, Denmark
| | - Laurent Falquet
- Department of Biology, University of Fribourg, Fribourg, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Aycan Gundogdu
- Department of Microbiology and Clinical Microbiology, Faculty of Medicine, Erciyes University, Kayseri, Turkey
- Metagenomics Laboratory, Genome and Stem Cell Center (GenKök), Erciyes University, Kayseri, Turkey
| | - Karel Hron
- Department of Mathematical Analysis and Applications of Mathematics, Palacký University, Olomouc, Czechia
| | | | - Marta B. Lopes
- NOVA Laboratory for Computer Science and Informatics (NOVA LINCS), FCT, UNL, Caparica, Portugal
- Centro de Matemática e Aplicações (CMA), FCT, UNL, Caparica, Portugal
| | - Laura Judith Marcos-Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | - Cláudia Marques
- CINTESIS, NOVA Medical School, NMS, Universidade Nova de Lisboa, Lisbon, Portugal
| | - Michael Mason
- Computational Oncology, Sage Bionetworks, Seattle, WA, United States
| | - Patrick May
- Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Lejla Pašić
- Sarajevo Medical School, University Sarajevo School of Science and Technology, Sarajevo, Bosnia and Herzegovina
| | - Gianvito Pio
- Department of Computer Science, University of Bari Aldo Moro, Bari, Italy
| | - Sándor Pongor
- Faculty of Information Tehnology and Bionics, Pázmány University, Budapest, Hungary
| | - Vasilis J. Promponas
- Bioinformatics Research Laboratory, Department of Biological Sciences, University of Cyprus, Nicosia, Cyprus
| | - Piotr Przymus
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruñ, Poland
| | - Julio Saez-Rodriguez
- Institute of Computational Biomedicine, Heidelberg University, Faculty of Medicine and Heidelberg University Hospital, Heidelberg, Germany
| | - Alexia Sampri
- Division of Informatics, Imaging and Data Sciences, School of Health Sciences, University of Manchester, Manchester, United Kingdom
| | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Blaz Stres
- Jozef Stefan Institute, Ljubljana, Slovenia
- Biotechnical Faculty, University of Ljubljana, Ljubljana, Slovenia
- Faculty of Civil and Geodetic Engineering, University of Ljubljana, Ljubljana, Slovenia
| | - Ramona Suharoschi
- Molecular Nutrition and Proteomics Lab, Faculty of the Food Science and Technology, Institute of Life Sciences, University of Agricultural Sciences and Veterinary Medicine of Cluj-Napoca, Cluj-Napoca, Romania
| | - Jaak Truu
- Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | - Ciprian-Octavian Truică
- Department of Computer Science and Engineering, Faculty of Automatic Control and Computers, University Politehnica of Bucharest, Bucharest, Romania
| | - Baiba Vilne
- Bioinformatics Research Unit, Riga Stradins University, Riga, Latvia
| | - Dimitrios Vlachakis
- Laboratory of Genetics, Department of Biotechnology, School of Applied Biology and Biotechnology, Agricultural University of Athens, Athens, Greece
| | - Ercument Yilmaz
- Department of Computer Technologies, Karadeniz Technical University, Trabzon, Turkey
| | - Georg Zeller
- European Molecular Biology Laboratory, Structural and Computational Biology Unit, Heidelberg, Germany
| | - Aldert L. Zomer
- Department of Infectious Diseases and Immunology, Faculty of Veterinary Medicine, Utrecht University, Utrecht, Netherlands
| | - David Gómez-Cabrero
- Navarrabiomed, Complejo Hospitalario de Navarra (CHN), IdiSNA, Universidad Pública de Navarra (UPNA), Pamplona, Spain
| | - Marcus J. Claesson
- School of Microbiology and APC Microbiome Ireland, University College Cork, Cork, Ireland
| |
Collapse
|
2
|
Willis AD, Martin BD. Estimating diversity in networked ecological communities. Biostatistics 2020; 23:207-222. [PMID: 32432696 PMCID: PMC8759443 DOI: 10.1093/biostatistics/kxaa015] [Citation(s) in RCA: 59] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2019] [Revised: 03/04/2020] [Accepted: 03/08/2020] [Indexed: 01/09/2023] Open
Abstract
Comparing ecological communities across environmental gradients can be challenging, especially when the number of different taxonomic groups in the communities is large. In this setting, community-level summaries called diversity indices are widely used to detect changes in the community ecology. However, estimation of diversity indices has received relatively little attention from the statistical community. The most common estimates of diversity are the maximum likelihood estimates of the parameters of a multinomial model, even though the multinomial model implies strict assumptions about the sampling mechanism. In particular, the multinomial model prohibits ecological networks, where taxa positively and negatively co-occur. In this article, we leverage models from the compositional data literature that explicitly account for co-occurrence networks and use them to estimate diversity. Instead of proposing new diversity indices, we estimate popular diversity indices under these models. While the methodology is general, we illustrate the approach for the estimation of the Shannon, Simpson, Bray–Curtis, and Euclidean diversity indices. We contrast our method to multinomial, low-rank, and nonparametric methods for estimating diversity indices. Under simulation, we find that the greatest gains of the method are in strongly networked communities with many taxa. Therefore, to illustrate the method, we analyze the microbiome of seafloor basalts based on a 16S amplicon sequencing dataset with 1425 taxa and 12 communities.
Collapse
Affiliation(s)
- Amy D Willis
- Department of Biostatistics and Department of Statistics, University of Washington, Health Sciences Building, 1959 NE Pacific St, Seattle WA 98195, USA
| | - Bryan D Martin
- Department of Biostatistics and Department of Statistics, University of Washington, Health Sciences Building, 1959 NE Pacific St, Seattle WA 98195, USA
| |
Collapse
|