1
|
Sun W, Wang S, Bi J, Ning Z, Wang J, Hou H. Study on the response mechanisms and evolution prediction of groundwater microbial-toxicological indicators. WATER ENVIRONMENT RESEARCH : A RESEARCH PUBLICATION OF THE WATER ENVIRONMENT FEDERATION 2024; 96:e11131. [PMID: 39327691 DOI: 10.1002/wer.11131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2024] [Revised: 08/10/2024] [Accepted: 08/28/2024] [Indexed: 09/28/2024]
Abstract
This study aims to investigate the response mechanisms of groundwater microbial-toxicological indicators, specifically total bacteria count (TBC) and total coliform count (TCC), to water quality indicators and environmental conditions. Using data from a water source in the western plateau of China, a predictive model focusing on TBC and TCC was developed. An orthogonal experimental design was employed to manipulate environmental conditions such as temperature, pH, and porosity, facilitating laboratory experiments. These experiments measured pH, chemical oxygen demand (COD), oxidation-reduction potential (ORP), TBC, and TCC at varying depths and environmental conditions. Principal component analysis elucidated the mechanisms by which water quality indicators and environmental conditions affect groundwater microbial-toxicological indicators. A prediction model for these indicators in plateau regions was established based on a backpropagation neural network (BP-NN), using TBC and TCC as target variables and the newly extracted principal components as influencing factors. The results demonstrate that environmental conditions and water quality indicators primarily influence the evolution of groundwater microbial-toxicological indicators by altering the ionic charge quantities, redox conditions, and temperature of the groundwater. The predictive model for groundwater microbial-toxicological indicators shows trends consistent with experimental outcomes, with an average relative error of less than 15%, meeting engineering requirements. PRACTITIONER POINTS: The values of total bacteria count (TBC) and total coliform count (TCC) under different conditions were obtained by column experiments. The influence mechanism of environmental conditions and groundwater indicators on TBC and TCC was elaborated by principal component analysis. TBC and TCC prediction models were established through the investigation of water sources in a plateau area and laboratory experiments.
Collapse
Affiliation(s)
- Weichao Sun
- Institute of Hydrogeology and Environmental Geology, Chinese Academy of Geological Sciences, Shijiazhuang, Hebei, China
- School of Chinese Academy of Geological Sciences, China University of Geosciences (Beijing), Beijing, China
- Key Laboratory of Groundwater Remediation of Hebei Province & China Geological Survey, Zhengding, Hebei, China
| | - Shuaiwei Wang
- Institute of Hydrogeology and Environmental Geology, Chinese Academy of Geological Sciences, Shijiazhuang, Hebei, China
- Key Laboratory of Groundwater Remediation of Hebei Province & China Geological Survey, Zhengding, Hebei, China
| | - Junbo Bi
- Institute of Hydrogeology and Environmental Geology, Chinese Academy of Geological Sciences, Shijiazhuang, Hebei, China
- School of Chinese Academy of Geological Sciences, China University of Geosciences (Beijing), Beijing, China
- Key Laboratory of Groundwater Remediation of Hebei Province & China Geological Survey, Zhengding, Hebei, China
| | - Zhuo Ning
- Institute of Hydrogeology and Environmental Geology, Chinese Academy of Geological Sciences, Shijiazhuang, Hebei, China
- Key Laboratory of Groundwater Remediation of Hebei Province & China Geological Survey, Zhengding, Hebei, China
| | - Jingjing Wang
- Institute of Hydrogeology and Environmental Geology, Chinese Academy of Geological Sciences, Shijiazhuang, Hebei, China
- Key Laboratory of Groundwater Remediation of Hebei Province & China Geological Survey, Zhengding, Hebei, China
| | - Haibo Hou
- China Construction Eighth Engineering Bureau No.1 Construction Co., Ltd, Jinan, Shandong, China
| |
Collapse
|
2
|
Hosseiniyan Khatibi SM, Dimaano NG, Veliz E, Sundaresan V, Ali J. Exploring and exploiting the rice phytobiome to tackle climate change challenges. PLANT COMMUNICATIONS 2024:101078. [PMID: 39233440 DOI: 10.1016/j.xplc.2024.101078] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/27/2024] [Revised: 08/07/2024] [Accepted: 09/02/2024] [Indexed: 09/06/2024]
Abstract
The future of agriculture is uncertain under the current climate change scenario. Climate change directly and indirectly affects the biotic and abiotic elements that control agroecosystems, jeopardizing the safety of the world's food supply. A new area that focuses on characterizing the phytobiome is emerging. The phytobiome comprises plants and their immediate surroundings, involving numerous interdependent microscopic and macroscopic organisms that affect the health and productivity of plants. Phytobiome studies primarily focus on the microbial communities associated with plants, which are referred to as the plant microbiome. The development of high-throughput sequencing technologies over the past 10 years has dramatically advanced our understanding of the structure, functionality, and dynamics of the phytobiome; however, comprehensive methods for using this knowledge are lacking, particularly for major crops such as rice. Considering the impact of rice production on world food security, gaining fresh perspectives on the interdependent and interrelated components of the rice phytobiome could enhance rice production and crop health, sustain rice ecosystem function, and combat the effects of climate change. Our review re-conceptualizes the complex dynamics of the microscopic and macroscopic components in the rice phytobiome as influenced by human interventions and changing environmental conditions driven by climate change. We also discuss interdisciplinary and systematic approaches to decipher and reprogram the sophisticated interactions in the rice phytobiome using novel strategies and cutting-edge technology. Merging the gigantic datasets and complex information on the rice phytobiome and their application in the context of regenerative agriculture could lead to sustainable rice farming practices that are resilient to the impacts of climate change.
Collapse
Affiliation(s)
| | - Niña Gracel Dimaano
- International Rice Research Institute, Los Baños, Laguna, Philippines; College of Agriculture and Food Science, University of the Philippines Los Baños, Los Baños, Laguna, Philippines
| | - Esteban Veliz
- College of Biological Sciences, University of California, Davis, Davis, CA, USA
| | - Venkatesan Sundaresan
- College of Biological Sciences, University of California, Davis, Davis, CA, USA; College of Agricultural and Environmental Sciences, University of California, Davis, Davis, CA, USA
| | - Jauhar Ali
- International Rice Research Institute, Los Baños, Laguna, Philippines.
| |
Collapse
|
3
|
Oliver A, Kay M, Lemay DG. TaxaHFE: a machine learning approach to collapse microbiome datasets using taxonomic structure. BIOINFORMATICS ADVANCES 2023; 3:vbad165. [PMID: 38046097 PMCID: PMC10689668 DOI: 10.1093/bioadv/vbad165] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/25/2023] [Revised: 09/28/2023] [Accepted: 11/27/2023] [Indexed: 12/05/2023]
Abstract
Motivation Biologists increasingly turn to machine learning models not just to predict, but to explain. Feature reduction is a common approach to improve both the performance and interpretability of models. However, some biological datasets, such as microbiome data, are inherently organized in a taxonomy, but these hierarchical relationships are not leveraged during feature reduction. We sought to design a feature engineering algorithm to exploit relationships in hierarchically organized biological data. Results We designed an algorithm, called TaxaHFE, to collapse information-poor features into their higher taxonomic levels. We applied TaxaHFE to six previously published datasets and found, on average, a 90% reduction in the number of features (SD = 5.1%) compared to using the most complete taxonomy. Using machine learning to compare the most resolved taxonomic level (i.e. species) against TaxaHFE-preprocessed features, models based on TaxaHFE features achieved an average increase of 3.47% in receiver operator curve area under the curve. Compared to other hierarchical feature engineering implementations, TaxaHFE introduces the novel ability to consider both categorical and continuous response variables to inform the feature set collapse. Importantly, we find TaxaHFE's ability to reduce hierarchically organized features to a more information-rich subset increases the interpretability of models. Availability and implementation TaxaHFE is available as a Docker image and as R code at https://github.com/aoliver44/taxaHFE.
Collapse
Affiliation(s)
- Andrew Oliver
- USDA-ARS Western Human Nutrition Research Center, Davis, CA 95616, United States
| | - Matthew Kay
- Independent Researcher, Washington, DC 20002, United States
| | - Danielle G Lemay
- USDA-ARS Western Human Nutrition Research Center, Davis, CA 95616, United States
- Department of Nutrition, University of California, Davis, Davis, CA 95616, United States
- Genome Center, University of California, Davis, Davis, CA 95616, United States
| |
Collapse
|
4
|
Arjmandi M, Fattahi M, Motevassel M, Rezaveisi H. Evaluating algorithms of decision tree, support vector machine and regression for anode side catalyst data in proton exchange membrane water electrolysis. Sci Rep 2023; 13:20309. [PMID: 37985795 PMCID: PMC10662483 DOI: 10.1038/s41598-023-47174-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Accepted: 11/09/2023] [Indexed: 11/22/2023] Open
Abstract
Nowadays, due to the various type of problems stemmed from using chemical compounds and fossil fuels which have widely influence on whole environment including acid rain, polar ice melting and etc., number of researches have been leading on replacing the nonrenewable energy sources with renewable ones in order to produce clean fuels. Among these, hydrogen emerges as a quintessential clean fuel, garnering substantial attention for its potential to be synthesized from the electric power generated by renewable sources like nuclear and solar energies. This is achieved through the employment of a proton exchange membrane water electrolysis (PEMWE) system, widely recognized as one of the most proficient and economically viable technologies for effecting the separation of H2O into H+ and OH-. In this study, the important affecting parameters on the anode side of catalyst in PEMWE and analyzed them by machine-learning (ML) algorithms through developing a data science (DS) procedure were discussed. Various machine learning models were subjected to comparison, wherein the Decision Tree models, specifically those configured with maximum depths of 3 and 4, emerged as the optimal choices, attaining a perfect 100% accuracy across both Dataset 1 and Dataset 2. Moreover, notable enhancements in accuracy values were observed for the Support Vector Machine (SVM) model, registering increments from 0.79 to 0.82 for Dataset 1 and 2, respectively. In stark contrast, the remaining models experienced a decrement in their accuracy scores. This phenomenon underscores the pivotal role played by the data generation process in rendering the models more faithful to real-world scenarios.
Collapse
Affiliation(s)
- Mahdi Arjmandi
- Chemical Engineering Department, Abadan Faculty of Petroleum Engineering, Petroleum University of Technology, Abadan, Iran
| | - Moslem Fattahi
- Chemical Engineering Department, Abadan Faculty of Petroleum Engineering, Petroleum University of Technology, Abadan, Iran.
- Department of Chemical and Materials Engineering, University of Alberta, Edmonton, AB, Canada.
| | - Mohsen Motevassel
- Chemical Engineering Department, Abadan Faculty of Petroleum Engineering, Petroleum University of Technology, Abadan, Iran
| | - Hosna Rezaveisi
- Chemical Engineering Department, Faculty of Engineering, Razi University, Kermanshah, Iran
| |
Collapse
|
5
|
Al-Aamri A, Kamarul Azman S, Daw Elbait G, Alsafar H, Henschel A. Critical assessment of on-premise approaches to scalable genome analysis. BMC Bioinformatics 2023; 24:354. [PMID: 37735350 PMCID: PMC10512525 DOI: 10.1186/s12859-023-05470-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Accepted: 09/08/2023] [Indexed: 09/23/2023] Open
Abstract
BACKGROUND Plummeting DNA sequencing cost in recent years has enabled genome sequencing projects to scale up by several orders of magnitude, which is transforming genomics into a highly data-intensive field of research. This development provides the much needed statistical power required for genotype-phenotype predictions in complex diseases. METHODS In order to efficiently leverage the wealth of information, we here assessed several genomic data science tools. The rationale to focus on on-premise installations is to cope with situations where data confidentiality and compliance regulations etc. rule out cloud based solutions. We established a comprehensive qualitative and quantitative comparison between BCFtools, SnpSift, Hail, GEMINI, and OpenCGA. The tools were compared in terms of data storage technology, query speed, scalability, annotation, data manipulation, visualization, data output representation, and availability. RESULTS Tools that leverage sophisticated data structures are noted as the most suitable for large-scale projects in varying degrees of scalability in comparison to flat-file manipulation (e.g., BCFtools, and SnpSift). Remarkably, for small to mid-size projects, even lightweight relational database. CONCLUSION The assessment criteria provide insights into the typical questions posed in scalable genomics and serve as guidance for the development of scalable computational infrastructure in genomics.
Collapse
Affiliation(s)
- Amira Al-Aamri
- Department of Electrical Engineering and Computer Science, College of Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
| | - Syafiq Kamarul Azman
- Department of Electrical Engineering and Computer Science, College of Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
| | - Gihan Daw Elbait
- Department of Biology, College of Arts and Sciences, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
- Center for Biotechnology (BTC), Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
| | - Habiba Alsafar
- Center for Biotechnology (BTC), Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
- Department of Biomedical Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
| | - Andreas Henschel
- Department of Electrical Engineering and Computer Science, College of Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates.
- Center for Biotechnology (BTC), Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates.
| |
Collapse
|
6
|
Xu W, Wang T, Wang N, Zhang H, Zha Y, Ji L, Chu Y, Ning K. Artificial intelligence-enabled microbiome-based diagnosis models for a broad spectrum of cancer types. Brief Bioinform 2023; 24:7152257. [PMID: 37141141 DOI: 10.1093/bib/bbad178] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2022] [Revised: 04/07/2023] [Accepted: 04/18/2023] [Indexed: 05/05/2023] Open
Abstract
Microbiome-based diagnosis of cancer is an increasingly important supplement for the genomics approach in cancer diagnosis, yet current models for microbiome-based diagnosis of cancer face difficulties in generality: not only diagnosis models could not be adapted from one cancer to another, but models built based on microbes from tissues could not be adapted for diagnosis based on microbes from blood. Therefore, a microbiome-based model suitable for a broad spectrum of cancer types is urgently needed. Here we have introduced DeepMicroCancer, a diagnosis model using artificial intelligence techniques for a broad spectrum of cancer types. Built based on the random forest models it has enabled superior performances on more than twenty types of cancers' tissue samples. And by using the transfer learning techniques, improved accuracies could be obtained, especially for cancer types with only a few samples, which could satisfy the requirement in clinical scenarios. Moreover, transfer learning techniques have enabled high diagnosis accuracy that could also be achieved for blood samples. These results indicated that certain sets of microbes could, if excavated using advanced artificial techniques, reveal the intricate differences among cancers and healthy individuals. Collectively, DeepMicroCancer has provided a new venue for accurate diagnosis of cancer based on tissue and blood materials, which could potentially be used in clinics.
Collapse
Affiliation(s)
- Wei Xu
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Teng Wang
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Nan Wang
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Haohong Zhang
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Yuguo Zha
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Lei Ji
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
- Geneis Beijing Co., Ltd., Beijing, 100102, China
- Qingdao Geneis Institute of Big Data Mining and Precision Medicine, Qingdao 266000, China
| | - Yuwen Chu
- Geneis Beijing Co., Ltd., Beijing, 100102, China
- Qingdao Geneis Institute of Big Data Mining and Precision Medicine, Qingdao 266000, China
- School of Electrical & Information Engineering, Anhui University of Technology, Anhui, 243002, China
| | - Kang Ning
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| |
Collapse
|
7
|
Tapio M, Fischer D, Mäntysaari P, Tapio I. Rumen Microbiota Predicts Feed Efficiency of Primiparous Nordic Red Dairy Cows. Microorganisms 2023; 11:1116. [PMID: 37317090 DOI: 10.3390/microorganisms11051116] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Revised: 04/17/2023] [Accepted: 04/23/2023] [Indexed: 06/16/2023] Open
Abstract
Efficient feed utilization in dairy cows is crucial for economic and environmental reasons. The rumen microbiota plays a significant role in feed efficiency, but studies utilizing microbial data to predict host phenotype are limited. In this study, 87 primiparous Nordic Red dairy cows were ranked for feed efficiency during their early lactation based on residual energy intake, and the rumen liquid microbial ecosystem was subsequently evaluated using 16S rRNA amplicon and metagenome sequencing. The study used amplicon data to build an extreme gradient boosting model, demonstrating that taxonomic microbial variation can predict efficiency (rtest = 0.55). Prediction interpreters and microbial network revealed that predictions were based on microbial consortia and the efficient animals had more of the highly interacting microbes and consortia. Rumen metagenome data was used to evaluate carbohydrate-active enzymes and metabolic pathway differences between efficiency phenotypes. The study showed that an efficient rumen had a higher abundance of glycoside hydrolases, while an inefficient rumen had more glycosyl transferases. Enrichment of metabolic pathways was observed in the inefficient group, while efficient animals emphasized bacterial environmental sensing and motility over microbial growth. The results suggest that inter-kingdom interactions should be further analyzed to understand their association with the feed efficiency of animals.
Collapse
Affiliation(s)
- Miika Tapio
- Genomics and Breeding, Production Systems, Natural Resources Institute Finland (Luke), 31600 Jokioinen, Finland
| | - Daniel Fischer
- Applied Statistical Methods, Natural Resources, Natural Resources Institute Finland (Luke), 31600 Jokioinen, Finland
| | - Päivi Mäntysaari
- Animal Nutrition, Production Systems, Natural Resources Institute Finland (Luke), 31600 Jokioinen, Finland
| | - Ilma Tapio
- Genomics and Breeding, Production Systems, Natural Resources Institute Finland (Luke), 31600 Jokioinen, Finland
| |
Collapse
|
8
|
Kumar R, Yadav G, Kuddus M, Ashraf GM, Singh R. Unlocking the microbial studies through computational approaches: how far have we reached? ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2023; 30:48929-48947. [PMID: 36920617 PMCID: PMC10016191 DOI: 10.1007/s11356-023-26220-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Accepted: 02/24/2023] [Indexed: 04/16/2023]
Abstract
The metagenomics approach accelerated the study of genetic information from uncultured microbes and complex microbial communities. In silico research also facilitated an understanding of protein-DNA interactions, protein-protein interactions, docking between proteins and phyto/biochemicals for drug design, and modeling of the 3D structure of proteins. These in silico approaches provided insight into analyzing pathogenic and nonpathogenic strains that helped in the identification of probable genes for vaccines and antimicrobial agents and comparing whole-genome sequences to microbial evolution. Artificial intelligence, more precisely machine learning (ML) and deep learning (DL), has proven to be a promising approach in the field of microbiology to handle, analyze, and utilize large data that are generated through nucleic acid sequencing and proteomics. This enabled the understanding of the functional and taxonomic diversity of microorganisms. ML and DL have been used in the prediction and forecasting of diseases and applied to trace environmental contaminants and environmental quality. This review presents an in-depth analysis of the recent application of silico approaches in microbial genomics, proteomics, functional diversity, vaccine development, and drug design.
Collapse
Affiliation(s)
- Rajnish Kumar
- Amity Institute of Biotechnology, Amity University Uttar Pradesh Lucknow Campus, Lucknow, Uttar Pradesh, India
- Department of Veterinary Medicine and Surgery, College of Veterinary Medicine, University of Missouri, Columbia, MO, USA
| | - Garima Yadav
- Amity Institute of Biotechnology, Amity University Uttar Pradesh Lucknow Campus, Lucknow, Uttar Pradesh, India
| | - Mohammed Kuddus
- Department of Biochemistry, College of Medicine, University of Hail, Hail, Saudi Arabia
| | - Ghulam Md Ashraf
- Department of Medical Laboratory Sciences, College of Health Sciences, and Sharjah Institute for Medical Research, University of Sharjah, Sharjah , 27272, United Arab Emirates
| | - Rachana Singh
- Amity Institute of Biotechnology, Amity University Uttar Pradesh Lucknow Campus, Lucknow, Uttar Pradesh, India.
| |
Collapse
|
9
|
Shen Y, Zhu J, Deng Z, Lu W, Wang H. EnsDeepDP: An Ensemble Deep Learning Approach for Disease Prediction Through Metagenomics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:986-998. [PMID: 36001521 DOI: 10.1109/tcbb.2022.3201295] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
A growing number of studies show that the human microbiome plays a vital role in human health and can be a crucial factor in predicting certain human diseases. However, microbiome data are often characterized by the limited samples and high-dimensional features, which pose a great challenge for machine learning methods. Therefore, this paper proposes a novel ensemble deep learning disease prediction method that combines unsupervised and supervised learning paradigms. First, unsupervised deep learning methods are used to learn the potential representation of the sample. Afterwards, the disease scoring strategy is developed based on the deep representations as the informative features for ensemble analysis. To ensure the optimal ensemble, a score selection mechanism is constructed, and performance boosting features are engaged with the original sample. Finally, the composite features are trained with gradient boosting classifier for health status decision. For case study, the ensemble deep learning flowchart has been demonstrated on six public datasets extracted from the human microbiome profiling. The results show that compared with the existing algorithms, our framework achieves better performance on disease prediction.
Collapse
|
10
|
Shtossel O, Isakov H, Turjeman S, Koren O, Louzoun Y. Ordering taxa in image convolution networks improves microbiome-based machine learning accuracy. Gut Microbes 2023; 15:2224474. [PMID: 37345233 PMCID: PMC10288916 DOI: 10.1080/19490976.2023.2224474] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Accepted: 06/08/2023] [Indexed: 06/23/2023] Open
Abstract
The human gut microbiome is associated with a large number of disease etiologies. As such, it is a natural candidate for machine-learning-based biomarker development for multiple diseases and conditions. The microbiome is often analyzed using 16S rRNA gene sequencing or shotgun metagenomics. However, several properties of microbial sequence-based studies hinder machine learning (ML), including non-uniform representation, a small number of samples compared with the dimension of each sample, and sparsity of the data, with the majority of taxa present in a small subset of samples. We show here using a graph representation that the cladogram structure is as informative as the taxa frequency. We then suggest a novel method to combine information from different taxa and improve data representation for ML using microbial taxonomy. iMic (image microbiome) translates the microbiome to images through an iterative ordering scheme, and applies convolutional neural networks to the resulting image. We show that iMic has a higher precision in static microbiome gene sequence-based ML than state-of-the-art methods. iMic also facilitates the interpretation of the classifiers through an explainable artificial intelligence (AI) algorithm to iMic to detect taxa relevant to each condition. iMic is then extended to dynamic microbiome samples by translating them to movies.
Collapse
Affiliation(s)
- Oshrit Shtossel
- Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel
| | - Haim Isakov
- Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel
| | - Sondra Turjeman
- The Azrieli Faculty of Medicine, Bar-Ilan University, Safed, Israel
| | - Omry Koren
- The Azrieli Faculty of Medicine, Bar-Ilan University, Safed, Israel
| | - Yoram Louzoun
- Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel
| |
Collapse
|
11
|
Gavin PG, Kim KW, Craig ME, Hill MM, Hamilton-Williams EE. Multi-omic interactions in the gut of children at the onset of islet autoimmunity. MICROBIOME 2022; 10:230. [PMID: 36527134 PMCID: PMC9756488 DOI: 10.1186/s40168-022-01425-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Accepted: 11/11/2022] [Indexed: 06/17/2023]
Abstract
BACKGROUND The gastrointestinal ecosystem is a highly complex environment with a profound influence on human health. Inflammation in the gut, linked to an altered gut microbiome, has been associated with the development of multiple human conditions including type 1 diabetes (T1D). Viruses infecting the gastrointestinal tract, especially enteroviruses, are also thought to play an important role in T1D pathogenesis possibly via overlapping mechanisms. However, it is not known whether the microbiome and virome act together or which risk factor may be of greater importance at the time when islet autoimmunity is initiated. RESULTS Here, we apply an integrative approach to combine comprehensive fecal virome, microbiome, and metaproteome data sampled before and at the onset of islet autoimmunity in 40 children at increased risk of T1D. We show strong age-related effects, with microbial and metaproteome diversity increasing with age while host antibody number and abundance declined with age. Mastadenovirus, which has been associated with a reduced risk of T1D, was associated with profound changes in the metaproteome indicating a functional shift in the microbiota. Multi-omic factor analysis modeling revealed a cluster of proteins associated with carbohydrate transport from the genus Faecalibacterium were associated with islet autoimmunity. CONCLUSIONS These findings demonstrate the interrelatedness of the gut microbiota, metaproteome and virome in young children. We show a functional remodeling of the gut microbiota accompanies both islet autoimmunity and viral infection with a switch in function in Faecalibacterium occurring at the onset of islet autoimmunity. Video Abstract.
Collapse
Affiliation(s)
- Patrick G Gavin
- Frazer Institute, The University of Queensland, Woolloongabba, QLD, Australia
- Present Address: Pulmonary and Critical Care Medicine, Brigham and Women's Hospital, Boston, MA, USA
- Present Address: Harvard Medical School, Boston, MA, USA
| | - Ki Wook Kim
- Virology Research Laboratory, Prince of Wales Hospital Randwick, Sydney, Australia
- School of Clinical Medicine, Discipline of Paediatrics and Child Health, Faculty of Medicine and Health, University of New South Wales, Sydney, Australia
| | - Maria E Craig
- Virology Research Laboratory, Prince of Wales Hospital Randwick, Sydney, Australia
- School of Clinical Medicine, Discipline of Paediatrics and Child Health, Faculty of Medicine and Health, University of New South Wales, Sydney, Australia
- Institute of Endocrinology and Diabetes, Children's Hospital at Westmead, Sydney, Australia
- Discipline of Child and Adolescent Health, University of Sydney, Sydney, Australia
| | - Michelle M Hill
- QIMR Berghofer Medical Research Institute, Brisbane, QLD, Australia
| | | |
Collapse
|
12
|
Shen WX, Liang SR, Jiang YY, Chen YZ. Enhanced metagenomic deep learning for disease prediction and consistent signature recognition by restructured microbiome 2D representations. PATTERNS (NEW YORK, N.Y.) 2022; 4:100658. [PMID: 36699735 PMCID: PMC9868677 DOI: 10.1016/j.patter.2022.100658] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Revised: 07/15/2022] [Accepted: 11/15/2022] [Indexed: 12/23/2022]
Abstract
Metagenomic analysis has been explored for disease diagnosis and biomarker discovery. Low sample sizes, high dimensionality, and sparsity of metagenomic data challenge metagenomic investigations. Here, an unsupervised microbial embedding, grouping, and mapping algorithm (MEGMA) was developed to transform metagenomic data into individualized multichannel microbiome 2D representation by manifold learning and clustering of microbial profiles (e.g., composition, abundance, hierarchy, and taxonomy). These 2D representations enable enhanced disease prediction by established ConvNet-based AggMapNet models, outperforming the commonly used machine learning and deep learning models in metagenomic benchmark datasets. These 2D representations combined with AggMapNet explainable module robustly identified more reliable and replicable disease-prediction microbes (biomarkers). Employing the MEGMA-AggMapNet pipeline for biomarker identification from 5 disease datasets, 84% of the identified biomarkers have been described in over 74 distinct works as important for these diseases. Moreover, the method also discovered highly consistent sets of biomarkers in cross-cohort colorectal cancer (CRC) patients and microbial shifts in different CRC stages.
Collapse
Affiliation(s)
- Wan Xiang Shen
- The State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Biology, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China,Bioinformatics and Drug Design Group, Department of Pharmacy, and Center for Computational Science and Engineering, National University of Singapore, Singapore 117543, Singapore
| | - Shu Ran Liang
- The State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Biology, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China
| | - Yu Yang Jiang
- The State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Biology, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China,Corresponding author
| | - Yu Zong Chen
- The State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Biology, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China,Shenzhen Bay Laboratory, Shenzhen 518000, China,Corresponding author
| |
Collapse
|
13
|
Interpreting tree ensemble machine learning models with endoR. PLoS Comput Biol 2022; 18:e1010714. [PMID: 36516158 PMCID: PMC9797088 DOI: 10.1371/journal.pcbi.1010714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Revised: 12/28/2022] [Accepted: 11/07/2022] [Indexed: 12/15/2022] Open
Abstract
Tree ensemble machine learning models are increasingly used in microbiome science as they are compatible with the compositional, high-dimensional, and sparse structure of sequence-based microbiome data. While such models are often good at predicting phenotypes based on microbiome data, they only yield limited insights into how microbial taxa may be associated. We developed endoR, a method to interpret tree ensemble models. First, endoR simplifies the fitted model into a decision ensemble. Then, it extracts information on the importance of individual features and their pairwise interactions, displaying them as an interpretable network. Both the endoR network and importance scores provide insights into how features, and interactions between them, contribute to the predictive performance of the fitted model. Adjustable regularization and bootstrapping help reduce the complexity and ensure that only essential parts of the model are retained. We assessed endoR on both simulated and real metagenomic data. We found endoR to have comparable accuracy to other common approaches while easing and enhancing model interpretation. Using endoR, we also confirmed published results on gut microbiome differences between cirrhotic and healthy individuals. Finally, we utilized endoR to explore associations between human gut methanogens and microbiome components. Indeed, these hydrogen consumers are expected to interact with fermenting bacteria in a complex syntrophic network. Specifically, we analyzed a global metagenome dataset of 2203 individuals and confirmed the previously reported association between Methanobacteriaceae and Christensenellales. Additionally, we observed that Methanobacteriaceae are associated with a network of hydrogen-producing bacteria. Our method accurately captures how tree ensembles use features and interactions between them to predict a response. As demonstrated by our applications, the resultant visualizations and summary outputs facilitate model interpretation and enable the generation of novel hypotheses about complex systems.
Collapse
|
14
|
Loganathan T, Priya Doss C G. The influence of machine learning technologies in gut microbiome research and cancer studies - A review. Life Sci 2022; 311:121118. [DOI: 10.1016/j.lfs.2022.121118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Revised: 10/19/2022] [Accepted: 10/19/2022] [Indexed: 11/18/2022]
|
15
|
Zeng W, Gautam A, Huson DH. DeepToA: An Ensemble Deep-Learning Approach to Predicting the Theater of Activity of a Microbiome. Bioinformatics 2022; 38:4670-4676. [PMID: 36029249 DOI: 10.1093/bioinformatics/btac584] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2022] [Revised: 07/19/2022] [Accepted: 08/26/2022] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Metagenomics is the study of microbiomes using DNA sequencing. A microbiome consists of an assemblage of microbes that is associated with a "theater of activity" (ToA). An important question is, to what degree does the taxonomic and functional content of the former depend on the (details of the) latter? Here we investigate a related technical question: Given a taxonomic and/or functional profile estimated from metagenomic sequencing data, how to predict the associated ToA? We present a deep-learning approach to this question. We use both taxonomic and functional profiles as input. We apply node2vec to embed hierarchical taxonomic profiles into numerical vectors. We then perform dimension reduction using clustering, to address the sparseness of the taxonomic data and thus make the problem more amenable to deep-learning algorithms. Functional features are combined with textual descriptions of protein families or domains. We present an ensemble deep-learning framework DeepToA for predicting the "theater of activity" of amicrobial community, based on taxonomic and functional profiles. We use SHAP (SHapley Additive exPlanations) values to determine which taxonomic and functional features are important for the prediction. RESULTS Based on 7,560 metagenomic profiles downloaded from MGnify, classified into ten different theaters of activity, we demonstrate that DeepToA has an accuracy of 98.30%. We show that adding textual information to functional features increases the accuracy. AVAILABILITY Our approach is available at http://ab.inf.uni-tuebingen.de/software/deeptoa. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wenhuan Zeng
- Institute for Bioinformatics and Medical Informatics, University of Tübingen, Tübingen, 72076, Germany
| | - Anupam Gautam
- Institute for Bioinformatics and Medical Informatics, University of Tübingen, Tübingen, 72076, Germany.,International Max Planck Research School "From Molecules to Organisms", Max Planck Institute for Biology Tübingen, Max-Planck-Ring 5, Tübingen, 72076, Germany
| | - Daniel H Huson
- Institute for Bioinformatics and Medical Informatics, University of Tübingen, Tübingen, 72076, Germany.,International Max Planck Research School "From Molecules to Organisms", Max Planck Institute for Biology Tübingen, Max-Planck-Ring 5, Tübingen, 72076, Germany.,Cluster of Excellence: Controlling Microbes to Fight Infection, Tübingen, Germany
| |
Collapse
|
16
|
Zhou YH, Sun G. Improve the Colorectal Cancer Diagnosis Using Gut Microbiome Data. Front Mol Biosci 2022; 9:921945. [PMID: 36032686 PMCID: PMC9415616 DOI: 10.3389/fmolb.2022.921945] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Accepted: 06/16/2022] [Indexed: 11/17/2022] Open
Abstract
In the United States, colorectal cancer is the second largest cause of cancer death, and accurate early detection and identification of high-risk patients is a high priority. Although fecal screening tests are available, the close relationship between colorectal cancer and the gut microbiome has generated considerable interest. We describe a machine learning method for gut microbiome data to assist in diagnosing colorectal cancer. Our methodology integrates feature engineering, mediation analysis, statistical modeling, and network analysis into a novel unified pipeline. Simulation results illustrate the value of the method in comparison to existing methods. For predicting colorectal cancer in two real datasets, this pipeline showed an 8.7% higher prediction accuracy and 13% higher area under the receiver operator characteristic curve than other published work. Additionally, the approach highlights important colorectal cancer-related taxa for prioritization, such as high levels of Bacteroides fragilis, which can help elucidate disease pathology. Our algorithms and approach can be widely applied for Colorectal cancer prediction using either 16 S rRNA or shotgun metagenomics data.
Collapse
Affiliation(s)
- Yi-Hui Zhou
- Department of Biological Sciences, North Carolina State University, Raleigh, NC, United States
- Binformatics Research Center, North Carolina State University, Raleigh, NC, United States
- *Correspondence: Yi-Hui Zhou,
| | - George Sun
- Alston Ridge Middle School, Cary, NC, United States
| |
Collapse
|
17
|
McElhinney JMWR, Catacutan MK, Mawart A, Hasan A, Dias J. Interfacing Machine Learning and Microbial Omics: A Promising Means to Address Environmental Challenges. Front Microbiol 2022; 13:851450. [PMID: 35547145 PMCID: PMC9083327 DOI: 10.3389/fmicb.2022.851450] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2022] [Accepted: 03/14/2022] [Indexed: 11/13/2022] Open
Abstract
Microbial communities are ubiquitous and carry an exceptionally broad metabolic capability. Upon environmental perturbation, microbes are also amongst the first natural responsive elements with perturbation-specific cues and markers. These communities are thereby uniquely positioned to inform on the status of environmental conditions. The advent of microbial omics has led to an unprecedented volume of complex microbiological data sets. Importantly, these data sets are rich in biological information with potential for predictive environmental classification and forecasting. However, the patterns in this information are often hidden amongst the inherent complexity of the data. There has been a continued rise in the development and adoption of machine learning (ML) and deep learning architectures for solving research challenges of this sort. Indeed, the interface between molecular microbial ecology and artificial intelligence (AI) appears to show considerable potential for significantly advancing environmental monitoring and management practices through their application. Here, we provide a primer for ML, highlight the notion of retaining biological sample information for supervised ML, discuss workflow considerations, and review the state of the art of the exciting, yet nascent, interdisciplinary field of ML-driven microbial ecology. Current limitations in this sphere of research are also addressed to frame a forward-looking perspective toward the realization of what we anticipate will become a pivotal toolkit for addressing environmental monitoring and management challenges in the years ahead.
Collapse
Affiliation(s)
- James M. W. R. McElhinney
- Applied Genomics Laboratory, Center for Membranes and Advanced Water Technology, Khalifa University, Abu Dhabi, United Arab Emirates
| | | | - Aurelie Mawart
- Applied Genomics Laboratory, Center for Membranes and Advanced Water Technology, Khalifa University, Abu Dhabi, United Arab Emirates
| | - Ayesha Hasan
- Applied Genomics Laboratory, Center for Membranes and Advanced Water Technology, Khalifa University, Abu Dhabi, United Arab Emirates
- Department of Biomedical Engineering, Khalifa University, Abu Dhabi, United Arab Emirates
| | - Jorge Dias
- EECS, Center for Autonomous Robotic Systems, Khalifa University, Abu Dhabi, United Arab Emirates
| |
Collapse
|
18
|
Chen X, Zhu Z, Zhang W, Wang Y, Wang F, Yang J, Wong KC. Human disease prediction from microbiome data by multiple feature fusion and deep learning. iScience 2022; 25:104081. [PMID: 35372808 PMCID: PMC8971930 DOI: 10.1016/j.isci.2022.104081] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2021] [Revised: 09/16/2021] [Accepted: 03/13/2022] [Indexed: 10/29/2022] Open
Abstract
Human disease prediction from microbiome data has broad implications in metagenomics. It is rare for the existing methods to consider abundance profiles from both known and unknown microbial organisms, or capture the taxonomic relationships among microbial taxa, leading to significant information loss. On the other hand, deep learning has shown unprecedented advantages in classification tasks for its feature-learning ability. However, it encounters the opposite situation in metagenome-based disease prediction since high-dimensional low-sample-size metagenomic datasets can lead to severe overfitting; and black-box model fails in providing biological explanations. To circumvent the related problems, we developed MetaDR, a comprehensive machine learning-based framework that integrates various information and deep learning to predict human diseases. Experimental results indicate that MetaDR achieves competitive prediction performance with a reduction in running time, and effectively discovers the informative features with biological insights.
Collapse
Affiliation(s)
- Xingjian Chen
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Zifan Zhu
- Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, USA
| | - Weitong Zhang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Yuchen Wang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Fuzhou Wang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Jianyi Yang
- School of Mathematical Sciences, Nankai University, Tianjin, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR.,Hong Kong Institute for Data Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| |
Collapse
|
19
|
Host phenotype classification from human microbiome data is mainly driven by the presence of microbial taxa. PLoS Comput Biol 2022; 18:e1010066. [PMID: 35446845 PMCID: PMC9064115 DOI: 10.1371/journal.pcbi.1010066] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Revised: 05/03/2022] [Accepted: 03/29/2022] [Indexed: 12/14/2022] Open
Abstract
Machine learning-based classification approaches are widely used to predict host phenotypes from microbiome data. Classifiers are typically employed by considering operational taxonomic units or relative abundance profiles as input features. Such types of data are intrinsically sparse, which opens the opportunity to make predictions from the presence/absence rather than the relative abundance of microbial taxa. This also poses the question whether it is the presence rather than the abundance of particular taxa to be relevant for discrimination purposes, an aspect that has been so far overlooked in the literature. In this paper, we aim at filling this gap by performing a meta-analysis on 4,128 publicly available metagenomes associated with multiple case-control studies. At species-level taxonomic resolution, we show that it is the presence rather than the relative abundance of specific microbial taxa to be important when building classification models. Such findings are robust to the choice of the classifier and confirmed by statistical tests applied to identifying differentially abundant/present taxa. Results are further confirmed at coarser taxonomic resolutions and validated on 4,026 additional 16S rRNA samples coming from 30 public case-control studies. The composition of the human microbiome has been linked to a large number of different diseases. In this context, classification methodologies based on machine learning approaches have represented a promising tool for diagnostic purposes from metagenomics data. The link between microbial population composition and host phenotypes has been usually performed by considering taxonomic profiles represented by relative abundances of microbial species. In this study, we show that it is more the presence rather than the relative abundance of microbial taxa to be relevant to maximize classification accuracy. This is accomplished by conducting a meta-analysis on more than 4,000 shotgun metagenomes coming from 25 case-control studies and in which original relative abundance data are degraded to presence/absence profiles. Findings are also extended to 16S rRNA data and advance the research field in building prediction models directly from human microbiome data.
Collapse
|
20
|
Lin YC, Salleb-Aouissi A, Hooven TA. Interpretable prediction of necrotizing enterocolitis from machine learning analysis of premature infant stool microbiota. BMC Bioinformatics 2022; 23:104. [PMID: 35337258 PMCID: PMC8953333 DOI: 10.1186/s12859-022-04618-w] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2021] [Accepted: 02/23/2022] [Indexed: 12/18/2022] Open
Abstract
Background Necrotizing enterocolitis (NEC) is a common, potentially catastrophic intestinal disease among very low birthweight premature infants. Affecting up to 15% of neonates born weighing less than 1500 g, NEC causes sudden-onset, progressive intestinal inflammation and necrosis, which can lead to significant bowel loss, multi-organ injury, or death. No unifying cause of NEC has been identified, nor is there any reliable biomarker that indicates an individual patient’s risk of the disease. Without a way to predict NEC in advance, the current medical strategy involves close clinical monitoring in an effort to treat babies with NEC as quickly as possible before irrecoverable intestinal damage occurs. In this report, we describe a novel machine learning application for generating dynamic, individualized NEC risk scores based on intestinal microbiota data, which can be determined from sequencing bacterial DNA from otherwise discarded infant stool. A central insight that differentiates our work from past efforts was the recognition that disease prediction from stool microbiota represents a specific subtype of machine learning problem known as multiple instance learning (MIL). Results We used a neural network-based MIL architecture, which we tested on independent datasets from two cohorts encompassing 3595 stool samples from 261 at-risk infants. Our report also introduces a new concept called the “growing bag” analysis, which applies MIL over time, allowing incorporation of past data into each new risk calculation. This approach allowed early, accurate NEC prediction, with a mean sensitivity of 86% and specificity of 90%. True-positive NEC predictions occurred an average of 8 days before disease onset. We also demonstrate that an attention-gated mechanism incorporated into our MIL algorithm permits interpretation of NEC risk, identifying several bacterial taxa that past work has associated with NEC, and potentially pointing the way toward new hypotheses about NEC pathogenesis. Our system is flexible, accepting microbiota data generated from targeted 16S or “shotgun” whole-genome DNA sequencing. It performs well in the setting of common, potentially confounding preterm neonatal clinical events such as perinatal cardiopulmonary depression, antibiotic administration, feeding disruptions, or transitions between breast feeding and formula. Conclusions We have developed and validated a robust MIL-based system for NEC prediction from harmlessly collected premature infant stool. While this system was developed for NEC prediction, our MIL approach may also be applicable to other diseases characterized by changes in the human microbiota. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04618-w.
Collapse
Affiliation(s)
- Yun Chao Lin
- Department of Computer Science, Columbia University, 1214 Amsterdam Ave., Mailcode 0401, New York, 10027, USA
| | - Ansaf Salleb-Aouissi
- Department of Computer Science, Columbia University, 1214 Amsterdam Ave., Mailcode 0401, New York, 10027, USA.
| | - Thomas A Hooven
- Department of Pediatrics, University of Pittsburgh School of Medicine, Pittsburgh, USA.,Richard King Mellon Institute for Pediatric Research, UPMC Children's Hospital of Pittsburgh, Pittsburgh, USA
| |
Collapse
|
21
|
Joishy TK, Jha A, Oudah M, Das S, Adak A, Deb D, Khan MR. Human Gut Microbes Associated with Systolic Blood Pressure. Int J Hypertens 2022; 2022:2923941. [PMID: 35154822 PMCID: PMC8831042 DOI: 10.1155/2022/2923941] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Accepted: 12/31/2021] [Indexed: 11/17/2022] Open
Abstract
Emerging studies have revealed a strong link between the gut microbiome and several human diseases. Since human gut microbiome mirrors variations in lifestyle and environment, whether associations between disease conditions and gut microbiome are consistent across populations-particularly in communities practicing traditional subsistence strategies whose microbiomes differ markedly from industrialists-remains unknown. Cardiovascular diseases are the leading cause of mortality in India affecting 55 million people, and high blood pressure is one of the primary risk factors for cardiovascular diseases. We examined associations between gut microbiome and blood pressure along with 14 other variables associated with lifestyle, dietary habits, disease conditions, and clinical blood markers in the three Assamese populations. Our analysis reveals a robust link between the gut microbiome diversity and composition and systolic blood pressure. Moreover, several genera previously associated with hypertension in non-Indian populations were also associated with systolic blood pressure in this cohort and these genera were predictors of elevated blood pressure in these populations. These findings confer opportunities to design personalized, preventative, and targeted interventions harnessing the gut microbiome to tackle the burden of cardiovascular diseases in India.
Collapse
Affiliation(s)
- Tulsi Kumari Joishy
- Molecular Biology and Microbial Biotechnology Laboratory, Life Sciences Division, Institute of Advanced Study in Science and Technology (IASST), Guwahati, Assam, India
- Department of Molecular Biology and Biotechnology, Cotton University, Guwahati, Assam, India
| | - Aashish Jha
- Genome Heritage Group, Program in Biology, New York University Abu Dhabi, Abu Dhabi, UAE
| | - Mai Oudah
- Program of Computer Science, New York University Abu Dhabi, Abu Dhabi, UAE
| | - Santanu Das
- Molecular Biology and Microbial Biotechnology Laboratory, Life Sciences Division, Institute of Advanced Study in Science and Technology (IASST), Guwahati, Assam, India
- Department of Molecular Biology and Biotechnology, Cotton University, Guwahati, Assam, India
| | - Atanu Adak
- Molecular Biology and Microbial Biotechnology Laboratory, Life Sciences Division, Institute of Advanced Study in Science and Technology (IASST), Guwahati, Assam, India
| | - Dibyayan Deb
- Molecular Biology and Microbial Biotechnology Laboratory, Life Sciences Division, Institute of Advanced Study in Science and Technology (IASST), Guwahati, Assam, India
- Department of Molecular Biology and Biotechnology, Cotton University, Guwahati, Assam, India
| | - Mojibur Rohman Khan
- Molecular Biology and Microbial Biotechnology Laboratory, Life Sciences Division, Institute of Advanced Study in Science and Technology (IASST), Guwahati, Assam, India
| |
Collapse
|
22
|
Xiang L, Jin X, Liu Y, Ma Y, Jian Z, Wei Z, Li H, Li Y, Wang K. Prediction of the occurrence of calcium oxalate kidney stones based on clinical and gut microbiota characteristics. World J Urol 2021; 40:221-227. [PMID: 34427737 PMCID: PMC8813786 DOI: 10.1007/s00345-021-03801-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Accepted: 08/02/2021] [Indexed: 02/08/2023] Open
Abstract
Purpose To predict the occurrence of calcium oxalate kidney stones based on clinical and gut microbiota characteristics. Methods Gut microbiota and clinical data from 180 subjects (120 for training set and 60 for validation) attending the West China Hospital (WCH) were collected between June 2018 and January 2021. Based on the gut microbiota and clinical data from 120 subjects (66 non-kidney stone individuals and 54 kidney stone patients), we evaluated eight machine learning methods to predict the occurrence of calcium oxalate kidney stones. Results With fivefold cross-validation, the random forest method produced the best area under the curve (AUC) of 0.94. We further applied random forest to an independent validation dataset with 60 samples (34 non-kidney stone individuals and 26 kidney stone patients), which yielded an AUC of 0.88. Conclusion Our results demonstrated that clinical data combined with gut microbiota characteristics may help predict the occurrence of kidney stones.
Collapse
Affiliation(s)
- Liyuan Xiang
- Department of Urology, Institute of Urology (Laboratory of Reconstructive Urology), West China Hospital, No. 37, Guoxue Alley, Chengdu, Sichuan Province, China.,Department of Clinical Research Management, West China Hospital, No. 37, Guoxue Alley, Chengdu, Sichuan Province, China
| | - Xi Jin
- Department of Urology, Institute of Urology (Laboratory of Reconstructive Urology), West China Hospital, No. 37, Guoxue Alley, Chengdu, Sichuan Province, China
| | - Yu Liu
- Department of Urology, Institute of Urology (Laboratory of Reconstructive Urology), West China Hospital, No. 37, Guoxue Alley, Chengdu, Sichuan Province, China
| | - Yucheng Ma
- Department of Urology, Institute of Urology (Laboratory of Reconstructive Urology), West China Hospital, No. 37, Guoxue Alley, Chengdu, Sichuan Province, China
| | - Zhongyu Jian
- Department of Urology, Institute of Urology (Laboratory of Reconstructive Urology), West China Hospital, No. 37, Guoxue Alley, Chengdu, Sichuan Province, China
| | - Zhitao Wei
- Department of Urology, Institute of Urology (Laboratory of Reconstructive Urology), West China Hospital, No. 37, Guoxue Alley, Chengdu, Sichuan Province, China
| | - Hong Li
- Department of Urology, Institute of Urology (Laboratory of Reconstructive Urology), West China Hospital, No. 37, Guoxue Alley, Chengdu, Sichuan Province, China
| | - Yi Li
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI, USA.
| | - Kunjie Wang
- Department of Urology, Institute of Urology (Laboratory of Reconstructive Urology), West China Hospital, No. 37, Guoxue Alley, Chengdu, Sichuan Province, China.
| |
Collapse
|
23
|
Yang F, Zou Q. mAML: an automated machine learning pipeline with a microbiome repository for human disease classification. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2020:5862399. [PMID: 32588040 PMCID: PMC7316531 DOI: 10.1093/database/baaa050] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/11/2020] [Revised: 05/27/2020] [Accepted: 06/03/2020] [Indexed: 12/20/2022]
Abstract
Due to the concerted efforts to utilize the microbial features to improve disease prediction capabilities, automated machine learning (AutoML) systems aiming to get rid of the tediousness in manually performing ML tasks are in great demand. Here we developed mAML, an ML model-building pipeline, which can automatically and rapidly generate optimized and interpretable models for personalized microbiome-based classification tasks in a reproducible way. The pipeline is deployed on a web-based platform, while the server is user-friendly and flexible and has been designed to be scalable according to the specific requirements. This pipeline exhibits high performance for 13 benchmark datasets including both binary and multi-class classification tasks. In addition, to facilitate the application of mAML and expand the human disease-related microbiome learning repository, we developed GMrepo ML repository (GMrepo Microbiome Learning repository) from the GMrepo database. The repository involves 120 microbiome-based classification tasks for 85 human-disease phenotypes referring to 12 429 metagenomic samples and 38 643 amplicon samples. The mAML pipeline and the GMrepo ML repository are expected to be important resources for researches in microbiology and algorithm developments. Database URL: http://lab.malab.cn/soft/mAML
Collapse
Affiliation(s)
- Fenglong Yang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No. 4, Section 2, North Jianshe Road, Chengdu 610054, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No. 4, Section 2, North Jianshe Road, Chengdu 610054, China
| |
Collapse
|
24
|
Li H, Ni J, Qing H. Gut Microbiota: Critical Controller and Intervention Target in Brain Aging and Cognitive Impairment. Front Aging Neurosci 2021; 13:671142. [PMID: 34248602 PMCID: PMC8267942 DOI: 10.3389/fnagi.2021.671142] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Accepted: 05/07/2021] [Indexed: 12/12/2022] Open
Abstract
The current trend for the rapid growth of the global aging population poses substantial challenges for society. The human aging process has been demonstrated to be closely associated with changes in gut microbiota composition, diversity, and functional features. During the first 2 years of life, the gut microbiota undergoes dramatic changes in composition and metabolic functions as it colonizes and develops in the body. Although the gut microbiota is nearly established by the age of three, it continues to mature until adulthood, when it comprises more stable and diverse microbial species. Meanwhile, as the physiological functions of the human body deteriorated with age, which may be a result of immunosenescence and "inflammaging," the guts of elderly people are generally characterized by an enrichment of pro-inflammatory microbes and a reduced abundance of beneficial species. The gut microbiota affects the development of the brain through a bidirectional communication system, called the brain-gut-microbiota (BGM) axis, and dysregulation of this communication is pivotal in aging-related cognitive impairment. Microbiota-targeted dietary interventions and the intake of probiotics/prebiotics can increase the abundance of beneficial species, boost host immunity, and prevent gut-related diseases. This review summarizes the age-related changes in the human gut microbiota based on recent research developments. Understanding these changes will likely facilitate the design of novel therapeutic strategies to achieve healthy aging.
Collapse
Affiliation(s)
| | - Junjun Ni
- Key Laboratory of Molecular Medicine and Biotherapy, School of Life Sciences, Beijing Institute of Technology, Beijing, China
| | - Hong Qing
- Key Laboratory of Molecular Medicine and Biotherapy, School of Life Sciences, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
25
|
Chen X, Liu L, Zhang W, Yang J, Wong KC. Human host status inference from temporal microbiome changes via recurrent neural networks. Brief Bioinform 2021; 22:6307015. [PMID: 34151933 DOI: 10.1093/bib/bbab223] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 04/21/2021] [Accepted: 04/21/2021] [Indexed: 01/04/2023] Open
Abstract
With the rapid increase in sequencing data, human host status inference (e.g. healthy or sick) from microbiome data has become an important issue. Existing studies are mostly based on single-point microbiome composition, while it is rare that the host status is predicted from longitudinal microbiome data. However, single-point-based methods cannot capture the dynamic patterns between the temporal changes and host status. Therefore, it remains challenging to build good predictive models as well as scaling to different microbiome contexts. On the other hand, existing methods are mainly targeted for disease prediction and seldom investigate other host statuses. To fill the gap, we propose a comprehensive deep learning-based framework that utilizes longitudinal microbiome data as input to infer the human host status. Specifically, the framework is composed of specific data preparation strategies and a recurrent neural network tailored for longitudinal microbiome data. In experiments, we evaluated the proposed method on both semi-synthetic and real datasets based on different sequencing technologies and metagenomic contexts. The results indicate that our method achieves robust performance compared to other baseline and state-of-the-art classifiers and provides a significant reduction in prediction time.
Collapse
Affiliation(s)
- Xingjian Chen
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| | - Lingjing Liu
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| | - Weitong Zhang
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| | - Jianyi Yang
- School of Mathematical Sciences, Nankai University, Kowloon, Hong Kong SAR
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| |
Collapse
|
26
|
Jasner Y, Belogolovski A, Ben-Itzhak M, Koren O, Louzoun Y. Microbiome Preprocessing Machine Learning Pipeline. Front Immunol 2021; 12:677870. [PMID: 34220823 PMCID: PMC8250139 DOI: 10.3389/fimmu.2021.677870] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2021] [Accepted: 05/07/2021] [Indexed: 11/13/2022] Open
Abstract
Background 16S sequencing results are often used for Machine Learning (ML) tasks. 16S gene sequences are represented as feature counts, which are associated with taxonomic representation. Raw feature counts may not be the optimal representation for ML. Methods We checked multiple preprocessing steps and tested the optimal combination for 16S sequencing-based classification tasks. We computed the contribution of each step to the accuracy as measured by the Area Under Curve (AUC) of the classification. Results We show that the log of the feature counts is much more informative than the relative counts. We further show that merging features associated with the same taxonomy at a given level, through a dimension reduction step for each group of bacteria improves the AUC. Finally, we show that z-scoring has a very limited effect on the results. Conclusions The prepossessing of microbiome 16S data is crucial for optimal microbiome based Machine Learning. These preprocessing steps are integrated into the MIPMLP - Microbiome Preprocessing Machine Learning Pipeline, which is available as a stand-alone version at: https://github.com/louzounlab/microbiome/tree/master/Preprocess or as a service at http://mip-mlp.math.biu.ac.il/Home Both contain the code, and standard test sets.
Collapse
Affiliation(s)
- Yoel Jasner
- Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel
| | | | | | - Omry Koren
- Azrieli Faculty of Medicine, Bar-Ilan University, Ramat Gan, Israel
| | - Yoram Louzoun
- Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel
| |
Collapse
|
27
|
Anyaso-Samuel S, Sachdeva A, Guha S, Datta S. Metagenomic Geolocation Prediction Using an Adaptive Ensemble Classifier. Front Genet 2021; 12:642282. [PMID: 33959149 PMCID: PMC8093763 DOI: 10.3389/fgene.2021.642282] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Accepted: 03/18/2021] [Indexed: 11/13/2022] Open
Abstract
Microbiome samples harvested from urban environments can be informative in predicting the geographic location of unknown samples. The idea that different cities may have geographically disparate microbial signatures can be utilized to predict the geographical location based on city-specific microbiome samples. We implemented this idea first; by utilizing standard bioinformatics procedures to pre-process the raw metagenomics samples provided by the CAMDA organizers. We trained several component classifiers and a robust ensemble classifier with data generated from taxonomy-dependent and taxonomy-free approaches. Also, we implemented class weighting and an optimal oversampling technique to overcome the class imbalance in the primary data. In each instance, we observed that the component classifiers performed differently, whereas the ensemble classifier consistently yielded optimal performance. Finally, we predicted the source cities of mystery samples provided by the organizers. Our results highlight the unreliability of restricting the classification of metagenomic samples to source origins to a single classification algorithm. By combining several component classifiers via the ensemble approach, we obtained classification results that were as good as the best-performing component classifier.
Collapse
Affiliation(s)
- Samuel Anyaso-Samuel
- Department of Biostatistics, University of Florida, Gainesville, FL, United States
| | - Archie Sachdeva
- Department of Biostatistics, University of Florida, Gainesville, FL, United States
| | - Subharup Guha
- Department of Biostatistics, University of Florida, Gainesville, FL, United States
| | - Somnath Datta
- Department of Biostatistics, University of Florida, Gainesville, FL, United States
| |
Collapse
|
28
|
Marcos-Zambrano LJ, Karaduzovic-Hadziabdic K, Loncar Turukalo T, Przymus P, Trajkovik V, Aasmets O, Berland M, Gruca A, Hasic J, Hron K, Klammsteiner T, Kolev M, Lahti L, Lopes MB, Moreno V, Naskinova I, Org E, Paciência I, Papoutsoglou G, Shigdel R, Stres B, Vilne B, Yousef M, Zdravevski E, Tsamardinos I, Carrillo de Santa Pau E, Claesson MJ, Moreno-Indias I, Truu J. Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment. Front Microbiol 2021; 12:634511. [PMID: 33737920 PMCID: PMC7962872 DOI: 10.3389/fmicb.2021.634511] [Citation(s) in RCA: 126] [Impact Index Per Article: 42.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Accepted: 02/01/2021] [Indexed: 12/19/2022] Open
Abstract
The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.
Collapse
Affiliation(s)
- Laura Judith Marcos-Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | | | | | - Piotr Przymus
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruń, Poland
| | - Vladimir Trajkovik
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Skopje, North Macedonia
| | - Oliver Aasmets
- Institute of Genomics, Estonian Genome Centre, University of Tartu, Tartu, Estonia
- Department of Biotechnology, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | - Magali Berland
- Université Paris-Saclay, INRAE, MGP, Jouy-en-Josas, France
| | - Aleksandra Gruca
- Department of Computer Networks and Systems, Silesian University of Technology, Gliwice, Poland
| | - Jasminka Hasic
- University Sarajevo School of Science and Technology, Sarajevo, Bosnia and Herzegovina
| | - Karel Hron
- Department of Mathematical Analysis and Applications of Mathematics, Palacký University, Olomouc, Czechia
| | | | - Mikhail Kolev
- South West University “Neofit Rilski”, Blagoevgrad, Bulgaria
| | - Leo Lahti
- Department of Computing, University of Turku, Turku, Finland
| | - Marta B. Lopes
- NOVA Laboratory for Computer Science and Informatics (NOVA LINCS), FCT, UNL, Caparica, Portugal
- Centro de Matemática e Aplicações (CMA), FCT, UNL, Caparica, Portugal
| | - Victor Moreno
- Oncology Data Analytics Program, Catalan Institute of Oncology (ICO)Barcelona, Spain
- Colorectal Cancer Group, Institut de Recerca Biomedica de Bellvitge (IDIBELL), Barcelona, Spain
- Consortium for Biomedical Research in Epidemiology and Public Health (CIBERESP), Barcelona, Spain
- Department of Clinical Sciences, Faculty of Medicine, University of Barcelona, Barcelona, Spain
| | - Irina Naskinova
- South West University “Neofit Rilski”, Blagoevgrad, Bulgaria
| | - Elin Org
- Institute of Genomics, Estonian Genome Centre, University of Tartu, Tartu, Estonia
| | - Inês Paciência
- EPIUnit – Instituto de Saúde Pública da Universidade do Porto, Porto, Portugal
| | | | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Blaz Stres
- Group for Microbiology and Microbial Biotechnology, Department of Animal Science, University of Ljubljana, Ljubljana, Slovenia
| | - Baiba Vilne
- Bioinformatics Research Unit, Riga Stradins University, Riga, Latvia
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel
| | - Eftim Zdravevski
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Skopje, North Macedonia
| | | | | | - Marcus J. Claesson
- School of Microbiology & APC Microbiome Ireland, University College Cork, Cork, Ireland
| | - Isabel Moreno-Indias
- Unidad de Gestión Clínica de Endocrinología y Nutrición, Instituto de Investigación Biomédica de Málaga (IBIMA), Hospital Clínico Universitario Virgen de la Victoria, Universidad de Málaga, Málaga, Spain
- Centro de Investigación Biomédica en Red de Fisiopatología de la Obesidad y la Nutrición (CIBEROBN), Instituto de Salud Carlos III, Madrid, Spain
| | - Jaak Truu
- Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| |
Collapse
|
29
|
Ghannam RB, Techtmann SM. Machine learning applications in microbial ecology, human microbiome studies, and environmental monitoring. Comput Struct Biotechnol J 2021; 19:1092-1107. [PMID: 33680353 PMCID: PMC7892807 DOI: 10.1016/j.csbj.2021.01.028] [Citation(s) in RCA: 89] [Impact Index Per Article: 29.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Revised: 01/16/2021] [Accepted: 01/18/2021] [Indexed: 01/04/2023] Open
Abstract
Advances in nucleic acid sequencing technology have enabled expansion of our ability to profile microbial diversity. These large datasets of taxonomic and functional diversity are key to better understanding microbial ecology. Machine learning has proven to be a useful approach for analyzing microbial community data and making predictions about outcomes including human and environmental health. Machine learning applied to microbial community profiles has been used to predict disease states in human health, environmental quality and presence of contamination in the environment, and as trace evidence in forensics. Machine learning has appeal as a powerful tool that can provide deep insights into microbial communities and identify patterns in microbial community data. However, often machine learning models can be used as black boxes to predict a specific outcome, with little understanding of how the models arrived at predictions. Complex machine learning algorithms often may value higher accuracy and performance at the sacrifice of interpretability. In order to leverage machine learning into more translational research related to the microbiome and strengthen our ability to extract meaningful biological information, it is important for models to be interpretable. Here we review current trends in machine learning applications in microbial ecology as well as some of the important challenges and opportunities for more broad application of machine learning to understanding microbial communities.
Collapse
Key Words
- 16S rRNA
- ANN, Artificial Neural Networks
- ASV, Amplicon Sequence Variant
- AUC, Area Under the Curve
- Forensics
- GB, Gradient Boosting
- ML, Machine Learning
- Machine learning
- Marker genes
- Metagenomics
- PCoA, Principal Coordinate Analysis
- RF, Random Forests
- ROC, Receiver Operating Characteristic
- SML, Supervised Machine Learning
- SVM, Support Vector Machines
- USML, Unsupervised Machine Learning
- tSNE, t-distributed Stochastic Neighbor Embedding
Collapse
Affiliation(s)
- Ryan B. Ghannam
- Department of Biological Sciences, Michigan Technological University, Houghton MI, United States
| | - Stephen M. Techtmann
- Department of Biological Sciences, Michigan Technological University, Houghton MI, United States
| |
Collapse
|
30
|
Reiman D, Farhat AM, Dai Y. Predicting Host Phenotype Based on Gut Microbiome Using a Convolutional Neural Network Approach. Methods Mol Biol 2021; 2190:249-266. [PMID: 32804370 DOI: 10.1007/978-1-0716-0826-5_12] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Accurate prediction of the host phenotypes from a microbial sample and identification of the associated microbial markers are important in understanding the impact of the microbiome on the pathogenesis and progression of various diseases within the host. A deep learning tool, PopPhy-CNN, has been developed for the task of predicting host phenotypes using a convolutional neural network (CNN). By representing samples as annotated taxonomic trees and further representing these trees as matrices, PopPhy-CNN utilizes the CNN's innate ability to explore locally similar microbes on the taxonomic tree. Furthermore, PopPhy-CNN can be used to evaluate the importance of each taxon in the prediction of host status. Here, we describe the underlying methodology, architecture, and core utility of PopPhy-CNN. We also demonstrate the use of PopPhy-CNN on a microbial dataset.
Collapse
Affiliation(s)
- Derek Reiman
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, USA
| | - Ali M Farhat
- College of Medicine, University of Illinois at Chicago, Chicago, IL, USA
| | - Yang Dai
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, USA.
| |
Collapse
|
31
|
Abstract
AbstractThis article aims to provide a thorough overview of the use of Artificial Intelligence (AI) techniques in studying the gut microbiota and its role in the diagnosis and treatment of some important diseases. The association between microbiota and diseases, together with its clinical relevance, is still difficult to interpret. The advances in AI techniques, such as Machine Learning (ML) and Deep Learning (DL), can help clinicians in processing and interpreting these massive data sets. Two research groups have been involved in this Scoping Review, working in two different areas of Europe: Florence and Sarajevo. The papers included in the review describe the use of ML or DL methods applied to the study of human gut microbiota. In total, 1109 papers were considered in this study. After elimination, a final set of 16 articles was considered in the scoping review. Different AI techniques were applied in the reviewed papers. Some papers applied ML, while others applied DL techniques. 11 papers evaluated just different ML algorithms (ranging from one to eight algorithms applied to one dataset). The remaining five papers examined both ML and DL algorithms. The most applied ML algorithm was Random Forest and it also exhibited the best performances.
Collapse
|
32
|
Pérez-Cobas AE, Gomez-Valero L, Buchrieser C. Metagenomic approaches in microbial ecology: an update on whole-genome and marker gene sequencing analyses. Microb Genom 2020; 6:mgen000409. [PMID: 32706331 PMCID: PMC7641418 DOI: 10.1099/mgen.0.000409] [Citation(s) in RCA: 47] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2019] [Accepted: 06/30/2020] [Indexed: 12/23/2022] Open
Abstract
Metagenomics and marker gene approaches, coupled with high-throughput sequencing technologies, have revolutionized the field of microbial ecology. Metagenomics is a culture-independent method that allows the identification and characterization of organisms from all kinds of samples. Whole-genome shotgun sequencing analyses the total DNA of a chosen sample to determine the presence of micro-organisms from all domains of life and their genomic content. Importantly, the whole-genome shotgun sequencing approach reveals the genomic diversity present, but can also give insights into the functional potential of the micro-organisms identified. The marker gene approach is based on the sequencing of a specific gene region. It allows one to describe the microbial composition based on the taxonomic groups present in the sample. It is frequently used to analyse the biodiversity of microbial ecosystems. Despite its importance, the analysis of metagenomic sequencing and marker gene data is quite a challenge. Here we review the primary workflows and software used for both approaches and discuss the current challenges in the field.
Collapse
Affiliation(s)
- Ana Elena Pérez-Cobas
- Institut Pasteur, Biologie des Bactéries Intracellulaires, Paris, France and CNRS UMR 3525, 675724, Paris, France
| | - Laura Gomez-Valero
- Institut Pasteur, Biologie des Bactéries Intracellulaires, Paris, France and CNRS UMR 3525, 675724, Paris, France
| | - Carmen Buchrieser
- Institut Pasteur, Biologie des Bactéries Intracellulaires, Paris, France and CNRS UMR 3525, 675724, Paris, France
| |
Collapse
|
33
|
Xia Y. Correlation and association analyses in microbiome study integrating multiomics in health and disease. PROGRESS IN MOLECULAR BIOLOGY AND TRANSLATIONAL SCIENCE 2020; 171:309-491. [PMID: 32475527 DOI: 10.1016/bs.pmbts.2020.04.003] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Correlation and association analyses are one of the most widely used statistical methods in research fields, including microbiome and integrative multiomics studies. Correlation and association have two implications: dependence and co-occurrence. Microbiome data are structured as phylogenetic tree and have several unique characteristics, including high dimensionality, compositionality, sparsity with excess zeros, and heterogeneity. These unique characteristics cause several statistical issues when analyzing microbiome data and integrating multiomics data, such as large p and small n, dependency, overdispersion, and zero-inflation. In microbiome research, on the one hand, classic correlation and association methods are still applied in real studies and used for the development of new methods; on the other hand, new methods have been developed to target statistical issues arising from unique characteristics of microbiome data. Here, we first provide a comprehensive view of classic and newly developed univariate correlation and association-based methods. We discuss the appropriateness and limitations of using classic methods and demonstrate how the newly developed methods mitigate the issues of microbiome data. Second, we emphasize that concepts of correlation and association analyses have been shifted by introducing network analysis, microbe-metabolite interactions, functional analysis, etc. Third, we introduce multivariate correlation and association-based methods, which are organized by the categories of exploratory, interpretive, and discriminatory analyses and classification methods. Fourth, we focus on the hypothesis testing of univariate and multivariate regression-based association methods, including alpha and beta diversities-based, count-based, and relative abundance (or compositional)-based association analyses. We demonstrate the characteristics and limitations of each approaches. Fifth, we introduce two specific microbiome-based methods: phylogenetic tree-based association analysis and testing for survival outcomes. Sixth, we provide an overall view of longitudinal methods in analysis of microbiome and omics data, which cover standard, static, regression-based time series methods, principal trend analysis, and newly developed univariate overdispersed and zero-inflated as well as multivariate distance/kernel-based longitudinal models. Finally, we comment on current association analysis and future direction of association analysis in microbiome and multiomics studies.
Collapse
Affiliation(s)
- Yinglin Xia
- Department of Medicine, University of Illinois at Chicago, Chicago, IL, United States.
| |
Collapse
|
34
|
Reiman D, Metwally AA, Sun J, Dai Y. PopPhy-CNN: A Phylogenetic Tree Embedded Architecture for Convolutional Neural Networks to Predict Host Phenotype From Metagenomic Data. IEEE J Biomed Health Inform 2020; 24:2993-3001. [PMID: 32396115 DOI: 10.1109/jbhi.2020.2993761] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Accurate prediction of the host phenotype from a metagenomic sample and identification of the associated microbial markers are important in understanding potential host-microbiome interactions related to disease initiation and progression. We introduce PopPhy-CNN, a novel convolutional neural network (CNN) learning framework that effectively exploits phylogenetic structure in microbial taxa for host phenotype prediction. Our approach takes an input format of a 2D matrix representing the phylogenetic tree populated with the relative abundance of microbial taxa in a metagenomic sample. This conversion empowers CNNs to explore the spatial relationship of the taxonomic annotations on the tree and their quantitative characteristics in metagenomic data. We show the competitiveness of our model compared to other available methods using nine metagenomic datasets of moderate size for binary classification. With synthetic and biological datasets, we show the superior and robust performance of our model for multi-class classification. Furthermore, we design a novel scheme for feature extraction from the learned CNN models and demonstrate improved performance when the extracted features. PopPhy-CNN is a practical deep learning framework for the prediction of host phenotype with the ability of facilitating the retrieval of predictive microbial taxa.
Collapse
|
35
|
Hooven TA, Lin AYC, Salleb-Aouissi A. Multiple Instance Learning for Predicting Necrotizing Enterocolitis in Premature Infants Using Microbiome Data. PROCEEDINGS OF THE ACM CONFERENCE ON HEALTH, INFERENCE, AND LEARNING 2020; 2020:99-109. [PMID: 34318306 PMCID: PMC8313028 DOI: 10.1145/3368555.3384466] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Necrotizing enterocolitis (NEC) is a life-threatening intestinal disease that primarily affects preterm infants during their first weeks after birth. Mortality rates associated with NEC are 15-30%, and surviving infants are susceptible to multiple serious, long-term complications. The disease is sporadic and, with currently available tools, unpredictable. We are creating an early warning system that uses stool microbiome features, combined with clinical and demographic information, to identify infants at high risk of developing NEC. Our approach uses a multiple instance learning, neural network-based system that could be used to generate daily or weekly NEC predictions for premature infants. The approach was selected to effectively utilize sparse and weakly annotated datasets characteristic of stool microbiome analysis. Here we describe initial validation of our system, using clinical and microbiome data from a nested case-control study of 161 preterm infants. We show receiver-operator curve areas above 0.9, with 75% of dominant predictive samples for NEC-affected infants identified at least 24 hours prior to disease onset. Our results pave the way for development of a real-time early warning system for NEC using a limited set of basic clinical and demographic details combined with stool microbiome data.
Collapse
|
36
|
A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science. UNSUPERVISED AND SEMI-SUPERVISED LEARNING 2020. [DOI: 10.1007/978-3-030-22475-2_1] [Citation(s) in RCA: 88] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
37
|
Taxonomy dimension reduction for colorectal cancer prediction. Comput Biol Chem 2019; 83:107160. [DOI: 10.1016/j.compbiolchem.2019.107160] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2019] [Revised: 11/02/2019] [Accepted: 11/04/2019] [Indexed: 02/01/2023]
|
38
|
LaPierre N, Ju CJT, Zhou G, Wang W. MetaPheno: A critical evaluation of deep learning and machine learning in metagenome-based disease prediction. Methods 2019; 166:74-82. [PMID: 30885720 PMCID: PMC6708502 DOI: 10.1016/j.ymeth.2019.03.003] [Citation(s) in RCA: 54] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2018] [Revised: 02/14/2019] [Accepted: 03/04/2019] [Indexed: 01/21/2023] Open
Abstract
The human microbiome plays a number of critical roles, impacting almost every aspect of human health and well-being. Conditions in the microbiome have been linked to a number of significant diseases. Additionally, revolutions in sequencing technology have led to a rapid increase in publicly-available sequencing data. Consequently, there have been growing efforts to predict disease status from metagenomic sequencing data, with a proliferation of new approaches in the last few years. Some of these efforts have explored utilizing a powerful form of machine learning called deep learning, which has been applied successfully in several biological domains. Here, we review some of these methods and the algorithms that they are based on, with a particular focus on deep learning methods. We also perform a deeper analysis of Type 2 Diabetes and obesity datasets that have eluded improved results, using a variety of machine learning and feature extraction methods. We conclude by offering perspectives on study design considerations that may impact results and future directions the field can take to improve results and offer more valuable conclusions. The scripts and extracted features for the analyses conducted in this paper are available via GitHub:https://github.com/nlapier2/metapheno.
Collapse
Affiliation(s)
- Nathan LaPierre
- Department of Computer Science, University of California at Los Angeles, Los Angeles, CA 90095, USA
| | - Chelsea J-T Ju
- Department of Computer Science, University of California at Los Angeles, Los Angeles, CA 90095, USA
| | - Guangyu Zhou
- Department of Computer Science, University of California at Los Angeles, Los Angeles, CA 90095, USA
| | - Wei Wang
- Department of Computer Science, University of California at Los Angeles, Los Angeles, CA 90095, USA.
| |
Collapse
|
39
|
Cruz AF, Barka GD, Blum LEB, Tanaka T, Ono N, Kanaya S, Reineke A. Evaluation of microbial communities in peels of Brazilian tropical fruits by amplicon sequence analysis. Braz J Microbiol 2019; 50:739-748. [PMID: 31073985 PMCID: PMC6863208 DOI: 10.1007/s42770-019-00088-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2018] [Accepted: 03/20/2019] [Indexed: 10/26/2022] Open
Abstract
Elucidation of the distinctive microbial taxonomic profiles of tropical fruit peels is the indispensable component of investigations aimed at the detection of microorganisms responsible for the post-harvest loss. The objective of the present work was to dissect the bacterial and fungal community of five tropical fruit peels (banana, guava, mango, papaya, and passion fruit) in wild (non-cultivated) and conventionally produced samples from Brazil. To that end, 16S rRNA-encoding gene and ITS rDNA amplicon analysis of the five tropical fruit peels were performed to discriminate the bacterial and fungal communities, respectively. The result showed that bacterial communities of the five types of fruit peels were by far more diversified than that of fungal communities, independent of the type of production system involved. Among the investigated fruits, non-cultivated papaya peels hosted the most diversified bacterial community while the least bacterial community diversity was found in the conventionally produced papaya fruit peels. The gene amplicon analysis clearly discriminated the bacterial community into their respective classes, while fungal communities were better classified in their phyla, yet with clearer component discrimination of fungal community based on the type of cultivation system practiced. Conventionally produced banana and non-cultivated passion fruit peels were characteristically dominated by fungal and bacterial groups, respectively. Overall, in conventionally produced fruit peels, bacterial community was mainly composed of Proteobacteria, Actinobacteria, and Bacilli. The result provided a broad microbial diversity profile that could be used as an important input for seeking alternative fruit spoilage control and post-harvest treatments.
Collapse
Affiliation(s)
- André Freire Cruz
- Graduate School of Life and Environmental Sciences, Kyoto Prefectural University, Kyoto, Japan
| | - Geleta Dugassa Barka
- Applied Biology Department, Adama Science and Technology University, Adama, Oromia, Ethiopia.
| | | | - Tetsushi Tanaka
- Division of Information Science, Nara Institute of Science and Technology, Nara, Japan
| | - Naoaki Ono
- Division of Information Science, Nara Institute of Science and Technology, Nara, Japan
| | - Shigehiko Kanaya
- Division of Information Science, Nara Institute of Science and Technology, Nara, Japan
| | - Annette Reineke
- Department of Crop Protection, Geisenheim University, Geisenheim, Germany
| |
Collapse
|
40
|
Zhou YH, Gallins P. A Review and Tutorial of Machine Learning Methods for Microbiome Host Trait Prediction. Front Genet 2019; 10:579. [PMID: 31293616 PMCID: PMC6603228 DOI: 10.3389/fgene.2019.00579] [Citation(s) in RCA: 95] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2019] [Accepted: 06/04/2019] [Indexed: 12/19/2022] Open
Abstract
With the growing importance of microbiome research, there is increasing evidence that host variation in microbial communities is associated with overall host health. Advancement in genetic sequencing methods for microbiomes has coincided with improvements in machine learning, with important implications for disease risk prediction in humans. One aspect specific to microbiome prediction is the use of taxonomy-informed feature selection. In this review for non-experts, we explore the most commonly used machine learning methods, and evaluate their prediction accuracy as applied to microbiome host trait prediction. Methods are described at an introductory level, and R/Python code for the analyses is provided.
Collapse
Affiliation(s)
- Yi-Hui Zhou
- Department of Biological Sciences, North Carolina State University, Raleigh, NC, United States
| | - Paul Gallins
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, United States
| |
Collapse
|
41
|
Qu K, Guo F, Liu X, Lin Y, Zou Q. Application of Machine Learning in Microbiology. Front Microbiol 2019; 10:827. [PMID: 31057526 PMCID: PMC6482238 DOI: 10.3389/fmicb.2019.00827] [Citation(s) in RCA: 95] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2019] [Accepted: 04/01/2019] [Indexed: 02/01/2023] Open
Abstract
Microorganisms are ubiquitous and closely related to people's daily lives. Since they were first discovered in the 19th century, researchers have shown great interest in microorganisms. People studied microorganisms through cultivation, but this method is expensive and time consuming. However, the cultivation method cannot keep a pace with the development of high-throughput sequencing technology. To deal with this problem, machine learning (ML) methods have been widely applied to the field of microbiology. Literature reviews have shown that ML can be used in many aspects of microbiology research, especially classification problems, and for exploring the interaction between microorganisms and the surrounding environment. In this study, we summarize the application of ML in microbiology.
Collapse
Affiliation(s)
- Kaiyang Qu
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Xiangrong Liu
- School of Information Science and Technology, Xiamen University, Xiamen, China
| | - Yuan Lin
- School of Information Science and Technology, Xiamen University, Xiamen, China
- Department of System Integration, Sparebanken Vest, Bergen, Norway
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|