1
|
Han W, Tang H, Ye Y. Locality-Sensitive Hashing-Based k-Mer Clustering for Identification of Differential Microbial Markers Related to Host Phenotype. J Comput Biol 2022; 29:738-751. [PMID: 35584271 PMCID: PMC9464365 DOI: 10.1089/cmb.2021.0640] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022] Open
Abstract
Microbial organisms play important roles in many aspects of human health and diseases. Encouraged by the numerous studies that show the association between microbiomes and human diseases, computational and machine learning methods have been recently developed to generate and utilize microbiome features for prediction of host phenotypes such as disease versus healthy cancer immunotherapy responder versus nonresponder. We have previously developed a subtractive assembly approach, which focuses on extraction and assembly of differential reads from metagenomic data sets that are likely sampled from differential genomes or genes between two groups of microbiome data sets (e.g., healthy vs. disease). In this article, we further improved our subtractive assembly approach by utilizing groups of k-mers with similar abundance profiles across multiple samples. We implemented a locality-sensitive hashing (LSH)-enabled approach (called kmerLSHSA) to group billions of k-mers into k-mer coabundance groups (kCAGs), which were subsequently used for the retrieval of differential kCAGs for subtractive assembly. Testing of the kmerLSHSA approach on simulated data sets and real microbiome data sets showed that, compared with the conventional approach that utilizes all genes, our approach can quickly identify differential genes that can be used for building promising predictive models for microbiome-based host phenotype prediction. We also discussed other potential applications of LSH-enabled clustering of k-mers according to their abundance profiles across multiple microbiome samples.
Collapse
Affiliation(s)
- Wontack Han
- Computer Science Department, Luddy School of Informatics, Computing and Engineering, Indiana University, Bloomington, Indiana, USA
| | - Haixu Tang
- Computer Science Department, Luddy School of Informatics, Computing and Engineering, Indiana University, Bloomington, Indiana, USA
| | - Yuzhen Ye
- Computer Science Department, Luddy School of Informatics, Computing and Engineering, Indiana University, Bloomington, Indiana, USA
| |
Collapse
|
2
|
Werbin ZR, Hackos B, Lopez-Nava J, Dietze MC, Bhatnagar JM. The National Ecological Observatory Network's soil metagenomes: assembly and basic analysis. F1000Res 2021; 10:299. [PMID: 35707452 PMCID: PMC9178279 DOI: 10.12688/f1000research.51494.1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/16/2021] [Indexed: 09/17/2023] Open
Abstract
The National Ecological Observatory Network (NEON) annually performs shotgun metagenomic sequencing to sample genes within soils at 47 sites across the United States. NEON serves as a valuable educational resource, thanks to its open data policies and programming tutorials, but there is currently no introductory tutorial for performing analyses with the soil shotgun metagenomic dataset. Here, we describe a workflow for processing raw soil metagenome sequencing reads using the Sunbeam bioinformatics pipeline. The workflow includes cleaning and processing raw reads, taxonomic classification, assembly into contigs, annotation of predicted genes using custom protein databases, and exporting assemblies to the KBase platform for downstream analysis. This workflow is designed to be robust to annual data releases from NEON, and the underlying Snakemake framework can manage complex software dependencies. The workflow presented here aims to increase the accessibility of NEON's shotgun metagenome data, which can provide important clues about soil microbial communities and their ecological roles.
Collapse
Affiliation(s)
- Zoey R. Werbin
- Department of Biology, Boston University, Boston, MA, 02215, USA
| | - Briana Hackos
- Department of Mathematics, University of Colorado, Boulder, Boulder, CO, 80309, USA
| | - Jorge Lopez-Nava
- Department of Mathematics, Swarthmore College, Swarthmore, PA 19081, USA
| | - Michael C. Dietze
- Department of Earth & Environment, Boston University, Boston, MA, 02215, USA
| | | |
Collapse
|
3
|
Information Theoretic Metagenome Assembly Allows the Discovery of Disease Biomarkers in Human Microbiome. ENTROPY 2021; 23:e23020187. [PMID: 33540903 PMCID: PMC7913240 DOI: 10.3390/e23020187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Revised: 01/11/2021] [Accepted: 01/13/2021] [Indexed: 11/26/2022]
Abstract
Quantitative metagenomics is an important field that has delivered successful microbiome biomarkers associated with host phenotypes. The current convention mainly depends on unsupervised assembly of metagenomic contigs with a possibility of leaving interesting genetic material unassembled. Additionally, biomarkers are commonly defined on the differential relative abundance of compositional or functional units. Accumulating evidence supports that microbial genetic variations are as important as the differential abundance content, implying the need for novel methods accounting for the genetic variations in metagenomics studies. We propose an information theoretic metagenome assembly algorithm, discovering genomic fragments with maximal self-information, defined by the empirical distributions of nucleotides across the phenotypes and quantified with the help of statistical tests. Our algorithm infers fragments populating the most informative genetic variants in a single contig, named supervariant fragments. Experiments on simulated metagenomes, as well as on a colorectal cancer and an atherosclerotic cardiovascular disease dataset consistently discovered sequences strongly associated with the disease phenotypes. Moreover, the discriminatory power of these putative biomarkers was mainly attributed to the genetic variations rather than relative abundance. Our results support that a focus on metagenomics methods considering microbiome population genetics might be useful in discovering disease biomarkers with a great potential of translating to molecular diagnostics and biotherapeutics applications.
Collapse
|
4
|
LaPierre N, Ju CJT, Zhou G, Wang W. MetaPheno: A critical evaluation of deep learning and machine learning in metagenome-based disease prediction. Methods 2019; 166:74-82. [PMID: 30885720 PMCID: PMC6708502 DOI: 10.1016/j.ymeth.2019.03.003] [Citation(s) in RCA: 54] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2018] [Revised: 02/14/2019] [Accepted: 03/04/2019] [Indexed: 01/21/2023] Open
Abstract
The human microbiome plays a number of critical roles, impacting almost every aspect of human health and well-being. Conditions in the microbiome have been linked to a number of significant diseases. Additionally, revolutions in sequencing technology have led to a rapid increase in publicly-available sequencing data. Consequently, there have been growing efforts to predict disease status from metagenomic sequencing data, with a proliferation of new approaches in the last few years. Some of these efforts have explored utilizing a powerful form of machine learning called deep learning, which has been applied successfully in several biological domains. Here, we review some of these methods and the algorithms that they are based on, with a particular focus on deep learning methods. We also perform a deeper analysis of Type 2 Diabetes and obesity datasets that have eluded improved results, using a variety of machine learning and feature extraction methods. We conclude by offering perspectives on study design considerations that may impact results and future directions the field can take to improve results and offer more valuable conclusions. The scripts and extracted features for the analyses conducted in this paper are available via GitHub:https://github.com/nlapier2/metapheno.
Collapse
Affiliation(s)
- Nathan LaPierre
- Department of Computer Science, University of California at Los Angeles, Los Angeles, CA 90095, USA
| | - Chelsea J-T Ju
- Department of Computer Science, University of California at Los Angeles, Los Angeles, CA 90095, USA
| | - Guangyu Zhou
- Department of Computer Science, University of California at Los Angeles, Los Angeles, CA 90095, USA
| | - Wei Wang
- Department of Computer Science, University of California at Los Angeles, Los Angeles, CA 90095, USA.
| |
Collapse
|
5
|
Han W, Ye Y. A repository of microbial marker genes related to human health and diseases for host phenotype prediction using microbiome data. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2019; 24:236-247. [PMID: 30864326 PMCID: PMC6417824] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
The microbiome research is going through an evolutionary transition from focusing on the characterization of reference microbiomes associated with different environments/hosts to the translational applications, including using microbiome for disease diagnosis, improving the effcacy of cancer treatments, and prevention of diseases (e.g., using probiotics). Microbial markers have been identified from microbiome data derived from cohorts of patients with different diseases, treatment responsiveness, etc, and often predictors based on these markers were built for predicting host phenotype given a microbiome dataset (e.g., to predict if a person has type 2 diabetes given his or her microbiome data). Unfortunately, these microbial markers and predictors are often not published so are not reusable by others. In this paper, we report the curation of a repository of microbial marker genes and predictors built from these markers for microbiome-based prediction of host phenotype, and a computational pipeline called Mi2P (from Microbiome to Phenotype) for using the repository. As an initial effort, we focus on microbial marker genes related to two diseases, type 2 diabetes and liver cirrhosis, and immunotherapy efficacy for two types of cancer, non-small-cell lung cancer (NSCLC) and renal cell carcinoma (RCC). We characterized the marker genes from metagenomic data using our recently developed subtractive assembly approach. We showed that predictors built from these microbial marker genes can provide fast and reasonably accurate prediction of host phenotype given microbiome data. As understanding and making use of microbiome data (our second genome) is becoming vital as we move forward in this age of precision health and precision medicine, we believe that such a repository will be useful for enabling translational applications of microbiome data.
Collapse
MESH Headings
- Carcinoma, Non-Small-Cell Lung/genetics
- Carcinoma, Non-Small-Cell Lung/microbiology
- Carcinoma, Non-Small-Cell Lung/therapy
- Carcinoma, Renal Cell/genetics
- Carcinoma, Renal Cell/microbiology
- Carcinoma, Renal Cell/therapy
- Computational Biology/methods
- Databases, Genetic
- Diabetes Mellitus, Type 2/genetics
- Diabetes Mellitus, Type 2/microbiology
- Genes, Microbial
- Genetic Markers
- Host Microbial Interactions/genetics
- Humans
- Immunotherapy
- Kidney Neoplasms/genetics
- Kidney Neoplasms/microbiology
- Kidney Neoplasms/therapy
- Liver Cirrhosis/genetics
- Liver Cirrhosis/microbiology
- Lung Neoplasms/genetics
- Lung Neoplasms/microbiology
- Lung Neoplasms/therapy
- Machine Learning
- Metagenomics/methods
- Metagenomics/statistics & numerical data
- Microbiota/genetics
- Phenotype
- Translational Research, Biomedical
Collapse
Affiliation(s)
- Wontack Han
- Computer Science Department, Indiana University, Bloomington, IN 47408, USA
| | | |
Collapse
|
6
|
Graft-Derived Reconstitution of Mucosal-Associated Invariant T Cells after Allogeneic Hematopoietic Cell Transplantation. Biol Blood Marrow Transplant 2017; 24:242-251. [PMID: 29024803 DOI: 10.1016/j.bbmt.2017.10.003] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2017] [Accepted: 10/02/2017] [Indexed: 02/05/2023]
Abstract
Mucosal-associated invariant T (MAIT) cells express a semi-invariant Vα7.2+ T cell receptor (TCR) that recognizes ligands from distinct bacterial and fungal species. In neonates, MAIT cells proliferate coincident with gastrointestinal (GI) bacterial colonization. In contrast, under noninflammatory conditions adult MAIT cells remain quiescent because of acquired regulation of TCR signaling. Effects of inflammation and the altered GI microbiota after allogeneic hematopoietic cell transplantation (HCT) on MAIT cell reconstitution have not been described. We conducted an observational study of MAIT cell reconstitution in myeloablative (n = 41) and nonmyeloablative (n = 66) allogeneic HCT recipients and found that despite a rapid and early increase to a plateau at day 30 after HCT, MAIT cell numbers failed to normalize for at least 1 year. Cord blood transplant recipients and those who received post-HCT cyclophosphamide for graft versus host disease (GVHD) prophylaxis had profoundly impaired MAIT cell reconstitution. Sharing of TCRβ gene sequences between MAIT cells isolated from HCT grafts and blood of recipients after HCT showed early MAIT cell reconstitution was due at least in part to proliferation of MAIT cells transferred in the HCT graft. Inflammatory cytokines were required for TCR-dependent MAIT cell proliferation, suggesting that bacterial Vα7.2+ TCR ligands might promote MAIT cell reconstitution after HCT. Robust MAIT cell reconstitution was associated with an increased GI abundance of Blautia spp. MAIT cells suppressed proliferation of conventional T cells consistent with a possible regulatory role. Our data identify modifiable factors impacting MAIT cell reconstitution that could influence the risk of GVHD after HCT.
Collapse
|
7
|
Han W, Wang M, Ye Y. A concurrent subtractive assembly approach for identification of disease associated sub-metagenomes. RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY : ... ANNUAL INTERNATIONAL CONFERENCE, RECOMB ... : PROCEEDINGS. RECOMB (CONFERENCE : 2005- ) 2017; 2017:18-33. [PMID: 29177251 DOI: 10.1007/978-3-319-56970-3_2] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Comparative analysis of metagenomes can be used to detect sub-metagenomes (species or gene sets) that are associated with specific phenotypes (e.g., host status). The typical workflow is to assemble and annotate metagenomic datasets individually or as a whole, followed by statistical tests to identify differentially abundant species/genes. We previously developed subtractive assembly (SA), a de novo assembly approach for comparative metagenomics that first detects differential reads that distinguish between two groups of metagenomes and then only assembles these reads. Application of SA to type 2 diabetes (T2D) microbiomes revealed new microbial genes associated with T2D. Here we further developed a Concurrent Subtractive Assembly (CoSA) approach, which uses a Wilcoxon rank-sum (WRS) test to detect k-mers that are differentially abundant between two groups of microbiomes (by contrast, SA only checks ratios of k-mer counts in one pooled sample versus the other). It then uses identified differential k-mers to extract reads that are likely sequenced from the sub-metagenome with consistent abundance differences between the groups of microbiomes. Further, CoSA attempts to reduce the redundancy of reads (from abundant common species) by excluding reads containing abundant k-mers. Using simulated microbiome datasets and T2D datasets, we show that CoSA achieves strikingly better performance in detecting consistent changes than SA does, and it enables the detection and assembly of genomes and genes with minor abundance difference. A SVM classifier built upon the microbial genes detected by CoSA from the T2D datasets can accurately discriminates patients from healthy controls, with an AUC of 0.94 (10-fold cross-validation), and therefore these differential genes (207 genes) may serve as potential microbial marker genes for T2D.
Collapse
Affiliation(s)
- Wontack Han
- Indiana University, Bloomington, Indiana, USA
| | | | - Yuzhen Ye
- Indiana University, Bloomington, Indiana, USA
| |
Collapse
|