1
|
Syama K, Jothi JAA, Khanna N. Automatic disease prediction from human gut metagenomic data using boosting GraphSAGE. BMC Bioinformatics 2023; 24:126. [PMID: 37003965 PMCID: PMC10067187 DOI: 10.1186/s12859-023-05251-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2023] [Accepted: 03/23/2023] [Indexed: 04/03/2023] Open
Abstract
BACKGROUND The human microbiome plays a critical role in maintaining human health. Due to the recent advances in high-throughput sequencing technologies, the microbiome profiles present in the human body have become publicly available. Hence, many works have been done to analyze human microbiome profiles. These works have identified that different microbiome profiles are present in healthy and sick individuals for different diseases. Recently, several computational methods have utilized the microbiome profiles to automatically diagnose and classify the host phenotype. RESULTS In this work, a novel deep learning framework based on boosting GraphSAGE is proposed for automatic prediction of diseases from metagenomic data. The proposed framework has two main components, (a). Metagenomic Disease graph (MD-graph) construction module, (b). Disease prediction Network (DP-Net) module. The graph construction module constructs a graph by considering each metagenomic sample as a node in the graph. The graph captures the relationship between the samples using a proximity measure. The DP-Net consists of a boosting GraphSAGE model which predicts the status of a sample as sick or healthy. The effectiveness of the proposed method is verified using real and synthetic datasets corresponding to diseases like inflammatory bowel disease and colorectal cancer. The proposed model achieved a highest AUC of 93%, Accuracy of 95%, F1-score of 95%, AUPRC of 95% for the real inflammatory bowel disease dataset and a best AUC of 90%, Accuracy of 91%, F1-score of 87% and AUPRC of 93% for the real colorectal cancer dataset. CONCLUSION The proposed framework outperforms other machine learning and deep learning models in terms of classification accuracy, AUC, F1-score and AUPRC for both synthetic and real metagenomic data.
Collapse
Affiliation(s)
- K Syama
- Department of Computer Science, Birla Institute of Technology and Science Pilani Dubai Campus, Dubai International Academic City , Dubai, UAE
| | - J Angel Arul Jothi
- Department of Computer Science, Birla Institute of Technology and Science Pilani Dubai Campus, Dubai International Academic City , Dubai, UAE.
| | | |
Collapse
|
2
|
Wang C, Zhang J, Veldsman WP, Zhou X, Zhang L. A comprehensive investigation of statistical and machine learning approaches for predicting complex human diseases on genomic variants. Brief Bioinform 2023; 24:6965909. [PMID: 36585786 DOI: 10.1093/bib/bbac552] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2022] [Revised: 11/04/2022] [Accepted: 11/14/2022] [Indexed: 01/01/2023] Open
Abstract
Quantifying an individual's risk for common diseases is an important goal of precision health. The polygenic risk score (PRS), which aggregates multiple risk alleles of candidate diseases, has emerged as a standard approach for identifying high-risk individuals. Although several studies have been performed to benchmark the PRS calculation tools and assess their potential to guide future clinical applications, some issues remain to be further investigated, such as lacking (i) various simulated data with different genetic effects; (ii) evaluation of machine learning models and (iii) evaluation on multiple ancestries studies. In this study, we systematically validated and compared 13 statistical methods, 5 machine learning models and 2 ensemble models using simulated data with additive and genetic interaction models, 22 common diseases with internal training sets, 4 common diseases with external summary statistics and 3 common diseases for trans-ancestry studies in UK Biobank. The statistical methods were better in simulated data from additive models and machine learning models have edges for data that include genetic interactions. Ensemble models are generally the best choice by integrating various statistical methods. LDpred2 outperformed the other standalone tools, whereas PRS-CS, lassosum and DBSLMM showed comparable performance. We also identified that disease heritability strongly affected the predictive performance of all methods. Both the number and effect sizes of risk SNPs are important; and sample size strongly influences the performance of all methods. For the trans-ancestry studies, we found that the performance of most methods became worse when training and testing sets were from different populations.
Collapse
Affiliation(s)
- Chonghao Wang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong SRA, China
| | - Jing Zhang
- Eye Institute and Department of Ophthalmology, NHC Key Laboratory of Myopia (Fudan University), Eye & ENT Hospital, Fudan University, Shanghai, China
| | | | - Xin Zhou
- Department of Biomedical Engineering, Vanderbilt University, Vanderbilt Place Nashville, 37235, TN, USA
| | - Lu Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong SRA, China
- Institute for Research and Continuing Education, Hong Kong Baptist University, Shenzhen, China
| |
Collapse
|
3
|
Multimodal deep learning applied to classify healthy and disease states of human microbiome. Sci Rep 2022; 12:824. [PMID: 35039534 PMCID: PMC8763943 DOI: 10.1038/s41598-022-04773-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Accepted: 12/30/2021] [Indexed: 12/22/2022] Open
Abstract
Metagenomic sequencing methods provide considerable genomic information regarding human microbiomes, enabling us to discover and understand microbial diseases. Compositional differences have been reported between patients and healthy people, which could be used in the diagnosis of patients. Despite significant progress in this regard, the accuracy of these tools needs to be improved for applications in diagnostics and therapeutics. MDL4Microbiome, the method developed herein, demonstrated high accuracy in predicting disease status by using various features from metagenome sequences and a multimodal deep learning model. We propose combining three different features, i.e., conventional taxonomic profiles, genome-level relative abundance, and metabolic functional characteristics, to enhance classification accuracy. This deep learning model enabled the construction of a classifier that combines these various modalities encoded in the human microbiome. We achieved accuracies of 0.98, 0.76, 0.84, and 0.97 for predicting patients with inflammatory bowel disease, type 2 diabetes, liver cirrhosis, and colorectal cancer, respectively; these are comparable or higher than classical machine learning methods. A deeper analysis was also performed on the resulting sets of selected features to understand the contribution of their different characteristics. MDL4Microbiome is a classifier with higher or comparable accuracy compared with other machine learning methods, which offers perspectives on feature generation with metagenome sequences in deep learning models and their advantages in the classification of host disease status.
Collapse
|
4
|
Seneviratne CJ, Balan P, Suriyanarayanan T, Lakshmanan M, Lee DY, Rho M, Jakubovics N, Brandt B, Crielaard W, Zaura E. Oral microbiome-systemic link studies: perspectives on current limitations and future artificial intelligence-based approaches. Crit Rev Microbiol 2020; 46:288-299. [PMID: 32434436 DOI: 10.1080/1040841x.2020.1766414] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
In the past decade, there has been a tremendous increase in studies on the link between oral microbiome and systemic diseases. However, variations in study design and confounding variables across studies often lead to inconsistent observations. In this narrative review, we have discussed the potential influence of study design and confounding variables on the current sequencing-based oral microbiome-systemic disease link studies. The current limitations of oral microbiome-systemic link studies on type 2 diabetes mellitus, rheumatoid arthritis, pregnancy, atherosclerosis, and pancreatic cancer are discussed in this review, followed by our perspective on how artificial intelligence (AI), particularly machine learning and deep learning approaches, can be employed for predicting systemic disease and host metadata from the oral microbiome. The application of AI for predicting systemic disease as well as host metadata requires the establishment of a global database repository with microbiome sequences and annotated host metadata. However, this task requires collective efforts from researchers working in the field of oral microbiome to establish more comprehensive datasets with appropriate host metadata. Development of AI-based models by incorporating consistent host metadata will allow prediction of systemic diseases with higher accuracies, bringing considerable clinical benefits.
Collapse
Affiliation(s)
- Chaminda Jayampath Seneviratne
- Singapore Oral Microbiomics Initiative (SOMI), National Dental Research Institute Singapore, National Dental Centre Singapore, Duke NUS Medical School, Singapore, Singapore
| | - Preethi Balan
- Singapore Oral Microbiomics Initiative (SOMI), National Dental Research Institute Singapore, National Dental Centre Singapore, Duke NUS Medical School, Singapore, Singapore
| | - Tanujaa Suriyanarayanan
- Singapore Oral Microbiomics Initiative (SOMI), National Dental Research Institute Singapore, National Dental Centre Singapore, Duke NUS Medical School, Singapore, Singapore
| | - Meiyappan Lakshmanan
- Bioprocessing Technology Institute (BTI), ASTAR - Agency for Science, Technology and Research, Singapore, Singapore
| | - Dong-Yup Lee
- Bioprocessing Technology Institute (BTI), ASTAR - Agency for Science, Technology and Research, Singapore, Singapore.,School of Chemical Engineering, Sungkyunkwan University, Jongno-gu, Republic of Korea
| | - Mina Rho
- Departments of Computer Science and Engineering & Biomedical Informatics, Hanyang University, Seoul, Korea
| | - Nicholas Jakubovics
- Oral Biology, School of Dental Sciences, Newcastle University, Newcastle upon Tyne, UK
| | - Bernd Brandt
- Department of Preventive Dentistry, Academic Centre for Dentistry Amsterdam, University of Amsterdam and Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
| | - Wim Crielaard
- Department of Preventive Dentistry, Academic Centre for Dentistry Amsterdam, University of Amsterdam and Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
| | - Egija Zaura
- Department of Preventive Dentistry, Academic Centre for Dentistry Amsterdam, University of Amsterdam and Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
| |
Collapse
|
5
|
Masoodi I, Alshanqeeti AS, Alyamani EJ, AlLehibi AA, Alqutub AN, Alsayari KN, Alomair AO. Microbial dysbiosis in irritable bowel syndrome: A single-center metagenomic study in Saudi Arabia. JGH OPEN 2020; 4:649-655. [PMID: 32782952 PMCID: PMC7411548 DOI: 10.1002/jgh3.12313] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/05/2019] [Revised: 12/18/2019] [Accepted: 01/23/2020] [Indexed: 12/13/2022]
Abstract
Background The focus of this study was to explore potential differences in colonic mucosal microbiota in irritable bowel syndrome (IBS) patients compared to a control group utilizing a metagenomic study. Methods Mucosal microbiota samples were collected from each IBS patient utilizing jet‐flushing colonic mucosa in unified segments of the colon with distilled water, followed by aspiration, during colonoscopy. All the purified dsDNA was extracted and quantified before metagenomic sequencing using an Illumina platform. An equal number of healthy age‐matched controls were also examined for colonic mucosal microbiota, which were obtained during screening colonoscopies. Results The microbiota data on 50 IBS patients (31 females), with a mean age 43.94 ± 14.50 (range19–65), were analyzed in comparison to 50 controls. Satisfactory DNA samples were subjected to metagenomics study, followed by comprehensive comparative phylogenetic analysis. Metagenomics analysis was carried out, and 3.58G reads were sequenced. Community richness (Chao) and microbial structure in IBS patients were shown to be significantly different from those in the control group. Enrichment of Oxalobacter formigenes, Sutterella wadsworthensis, and Bacteroides pectinophilus was significantly observed in controls, whereas enrichment of Collinsella aerofaciens, Gemella morbillorum, and Veillonella parvula Actinobacteria was observed significantly in the IBS cohort. Conclusion The current study has demonstrated significant differences in the microbiota of IBS patients compared to controls.
Collapse
Affiliation(s)
| | - Ali S Alshanqeeti
- National Blood & Cancer Center, Riyadh, Saudi Arabia Riyadh Saudi Arabia
| | - Essam J Alyamani
- National Center for Biotechnology King Abdulaziz City for Science and Technology (KACST) Riyadh Saudi Arabia
| | - Abed A AlLehibi
- Gastroenterology and Hepatology Department King Fahad Medical City Riyadh Saudi Arabia
| | - Adel N Alqutub
- Gastroenterology and Hepatology Department King Fahad Medical City Riyadh Saudi Arabia
| | - Khalid N Alsayari
- Gastroenterology and Hepatology Department King Fahad Medical City Riyadh Saudi Arabia
| | - Ahmed O Alomair
- Gastroenterology and Hepatology Department King Fahad Medical City Riyadh Saudi Arabia
| |
Collapse
|
6
|
Genome-wide analysis of H3K36me3 and its regulations to cancer-related genes expression in human cell lines. Biosystems 2018; 171:59-65. [DOI: 10.1016/j.biosystems.2018.07.004] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Revised: 07/01/2018] [Accepted: 07/09/2018] [Indexed: 01/11/2023]
|
7
|
Asgari E, Garakani K, McHardy AC, Mofrad MRK. MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples. Bioinformatics 2018; 34:i32-i42. [PMID: 29950008 PMCID: PMC6022683 DOI: 10.1093/bioinformatics/bty296] [Citation(s) in RCA: 45] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Motivation Microbial communities play important roles in the function and maintenance of various biosystems, ranging from the human body to the environment. A major challenge in microbiome research is the classification of microbial communities of different environments or host phenotypes. The most common and cost-effective approach for such studies to date is 16S rRNA gene sequencing. Recent falls in sequencing costs have increased the demand for simple, efficient and accurate methods for rapid detection or diagnosis with proved applications in medicine, agriculture and forensic science. We describe a reference- and alignment-free approach for predicting environments and host phenotypes from 16S rRNA gene sequencing based on k-mer representations that benefits from a bootstrapping framework for investigating the sufficiency of shallow sub-samples. Deep learning methods as well as classical approaches were explored for predicting environments and host phenotypes. Results A k-mer distribution of shallow sub-samples outperformed Operational Taxonomic Unit (OTU) features in the tasks of body-site identification and Crohn's disease prediction. Aside from being more accurate, using k-mer features in shallow sub-samples allows (i) skipping computationally costly sequence alignments required in OTU-picking and (ii) provided a proof of concept for the sufficiency of shallow and short-length 16S rRNA sequencing for phenotype prediction. In addition, k-mer features predicted representative 16S rRNA gene sequences of 18 ecological environments, and 5 organismal environments with high macro-F1 scores of 0.88 and 0.87. For large datasets, deep learning outperformed classical methods such as Random Forest and Support Vector Machine. Availability and implementation The software and datasets are available at https://llp.berkeley.edu/micropheno. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ehsaneddin Asgari
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, CA, USA
- Computational Biology of Infection Research, Helmholtz Center for Infection Research, Braunschweig, Germany
| | - Kiavash Garakani
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, CA, USA
| | - Alice C McHardy
- Computational Biology of Infection Research, Helmholtz Center for Infection Research, Braunschweig, Germany
| | - Mohammad R K Mofrad
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, CA, USA
- Molecular Biophysics and Integrated Bioimaging, Lawrence Berkeley National Lab, Berkeley, CA, USA
| |
Collapse
|