1
|
Valiei A, Dickson A, Aminian-Dehkordi J, Mofrad MRK. Metabolic interactions shape emergent biofilm structures in a conceptual model of gut mucosal bacterial communities. NPJ Biofilms Microbiomes 2024; 10:99. [PMID: 39358363 PMCID: PMC11447261 DOI: 10.1038/s41522-024-00572-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Accepted: 09/16/2024] [Indexed: 10/04/2024] Open
Abstract
The gut microbiome plays a major role in human health; however, little is known about the structural arrangement of microbes and factors governing their distribution. In this work, we present an in silico agent-based model (ABM) to conceptually simulate the dynamics of gut mucosal bacterial communities. We explored how various types of metabolic interactions, including competition, neutralism, commensalism, and mutualism, affect community structure, through nutrient consumption and metabolite exchange. Results showed that, across scenarios with different initial species abundances, cross-feeding promotes species coexistence. Morphologically, competition and neutralism resulted in segregation, while mutualism and commensalism fostered high intermixing. In addition, cooperative relations resulted in community properties with little sensitivity to the selective uptake of metabolites produced by the host. Moreover, metabolic interactions strongly influenced colonization success following the invasion of newcomer species. These results provide important insights into the utility of ABM in deciphering complex microbiome patterns.
Collapse
Affiliation(s)
- Amin Valiei
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, CA, 94720, USA
| | - Andrew Dickson
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, CA, 94720, USA
| | - Javad Aminian-Dehkordi
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, CA, 94720, USA
| | - Mohammad R K Mofrad
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, CA, 94720, USA.
- Molecular Biophysics and Integrative Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA.
| |
Collapse
|
2
|
Roy G, Prifti E, Belda E, Zucker JD. Deep learning methods in metagenomics: a review. Microb Genom 2024; 10:001231. [PMID: 38630611 PMCID: PMC11092122 DOI: 10.1099/mgen.0.001231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Accepted: 03/27/2024] [Indexed: 04/19/2024] Open
Abstract
The ever-decreasing cost of sequencing and the growing potential applications of metagenomics have led to an unprecedented surge in data generation. One of the most prevalent applications of metagenomics is the study of microbial environments, such as the human gut. The gut microbiome plays a crucial role in human health, providing vital information for patient diagnosis and prognosis. However, analysing metagenomic data remains challenging due to several factors, including reference catalogues, sparsity and compositionality. Deep learning (DL) enables novel and promising approaches that complement state-of-the-art microbiome pipelines. DL-based methods can address almost all aspects of microbiome analysis, including novel pathogen detection, sequence classification, patient stratification and disease prediction. Beyond generating predictive models, a key aspect of these methods is also their interpretability. This article reviews DL approaches in metagenomics, including convolutional networks, autoencoders and attention-based models. These methods aggregate contextualized data and pave the way for improved patient care and a better understanding of the microbiome's key role in our health.
Collapse
Affiliation(s)
- Gaspar Roy
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
| | - Edi Prifti
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
- Sorbonne University, INSERM, Nutriomics, 91 bvd de l’hopital, 75013 Paris, France
| | - Eugeni Belda
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
- Sorbonne University, INSERM, Nutriomics, 91 bvd de l’hopital, 75013 Paris, France
| | - Jean-Daniel Zucker
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
- Sorbonne University, INSERM, Nutriomics, 91 bvd de l’hopital, 75013 Paris, France
| |
Collapse
|
3
|
Regueira-Iglesias A, Balsa-Castro C, Blanco-Pintos T, Tomás I. Critical review of 16S rRNA gene sequencing workflow in microbiome studies: From primer selection to advanced data analysis. Mol Oral Microbiol 2023; 38:347-399. [PMID: 37804481 DOI: 10.1111/omi.12434] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 09/01/2023] [Accepted: 09/14/2023] [Indexed: 10/09/2023]
Abstract
The multi-batch reanalysis approach of jointly reevaluating gene/genome sequences from different works has gained particular relevance in the literature in recent years. The large amount of 16S ribosomal ribonucleic acid (rRNA) gene sequence data stored in public repositories and information in taxonomic databases of the same gene far exceeds that related to complete genomes. This review is intended to guide researchers new to studying microbiota, particularly the oral microbiota, using 16S rRNA gene sequencing and those who want to expand and update their knowledge to optimise their decision-making and improve their research results. First, we describe the advantages and disadvantages of using the 16S rRNA gene as a phylogenetic marker and the latest findings on the impact of primer pair selection on diversity and taxonomic assignment outcomes in oral microbiome studies. Strategies for primer selection based on these results are introduced. Second, we identified the key factors to consider in selecting the sequencing technology and platform. The process and particularities of the main steps for processing 16S rRNA gene-derived data are described in detail to enable researchers to choose the most appropriate bioinformatics pipeline and analysis methods based on the available evidence. We then produce an overview of the different types of advanced analyses, both the most widely used in the literature and the most recent approaches. Several indices, metrics and software for studying microbial communities are included, highlighting their advantages and disadvantages. Considering the principles of clinical metagenomics, we conclude that future research should focus on rigorous analytical approaches, such as developing predictive models to identify microbiome-based biomarkers to classify health and disease states. Finally, we address the batch effect concept and the microbiome-specific methods for accounting for or correcting them.
Collapse
Affiliation(s)
- Alba Regueira-Iglesias
- Oral Sciences Research Group, Special Needs Unit, Department of Surgery and Medical-Surgical Specialties, School of Medicine and Dentistry, Universidade de Santiago de Compostela, Health Research Institute of Santiago de Compostela (IDIS), Santiago de Compostela, A Coruña, Spain
| | - Carlos Balsa-Castro
- Oral Sciences Research Group, Special Needs Unit, Department of Surgery and Medical-Surgical Specialties, School of Medicine and Dentistry, Universidade de Santiago de Compostela, Health Research Institute of Santiago de Compostela (IDIS), Santiago de Compostela, A Coruña, Spain
| | - Triana Blanco-Pintos
- Oral Sciences Research Group, Special Needs Unit, Department of Surgery and Medical-Surgical Specialties, School of Medicine and Dentistry, Universidade de Santiago de Compostela, Health Research Institute of Santiago de Compostela (IDIS), Santiago de Compostela, A Coruña, Spain
| | - Inmaculada Tomás
- Oral Sciences Research Group, Special Needs Unit, Department of Surgery and Medical-Surgical Specialties, School of Medicine and Dentistry, Universidade de Santiago de Compostela, Health Research Institute of Santiago de Compostela (IDIS), Santiago de Compostela, A Coruña, Spain
| |
Collapse
|
4
|
Wassan JT, Wang H, Zheng H. Developing a New Phylogeny-Driven Random Forest Model for Functional Metagenomics. IEEE Trans Nanobioscience 2023; 22:763-770. [PMID: 37279136 DOI: 10.1109/tnb.2023.3283462] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Metagenomics is an unobtrusive science linking microbial genes to biological functions or environmental states. Classifying microbial genes into their functional repertoire is an important task in the downstream analysis of Metagenomic studies. The task involves Machine Learning (ML) based supervised methods to achieve good classification performance. Random Forest (RF) has been applied rigorously to microbial gene abundance profiles, mapping them to functional phenotypes. The current research targets tuning RF by the evolutionary ancestry of microbial phylogeny, developing a Phylogeny-RF model for functional classification of metagenomes. This method facilitates capturing the effects of phylogenetic relatedness in an ML classifier itself rather than just applying a supervised classifier over the raw abundances of microbial genes. The idea is rooted in the fact that closely related microbes by phylogeny are highly correlated and tend to have similar genetic and phenotypic traits. Such microbes behave similarly; and hence tend to be selected together, or one of these could be dropped from the analysis, to improve the ML process. The proposed Phylogeny-RF algorithm has been compared with state-of-the-art classification methods including RF and the phylogeny-aware methods of MetaPhyl and PhILR, using three real-world 16S rRNA metagenomic datasets. It has been observed that the proposed method not only achieved significantly better performance than the traditional RF model but also performed better than the other phylogeny-driven benchmarks (p < 0.05). For example, Phylogeny-RF attained a highest AUC of 0.949 and Kappa of 0.891 over soil microbiomes in comparison to other benchmarks.
Collapse
|
5
|
Babaiha NS, Aghdam R, Ghiam S, Eslahchi C. NN-RNALoc: Neural network-based model for prediction of mRNA sub-cellular localization using distance-based sub-sequence profiles. PLoS One 2023; 18:e0258793. [PMID: 37708177 PMCID: PMC10501558 DOI: 10.1371/journal.pone.0258793] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2021] [Accepted: 05/12/2023] [Indexed: 09/16/2023] Open
Abstract
The localization of messenger RNAs (mRNAs) is a frequently observed phenomenon and a crucial aspect of gene expression regulation. It is also a mechanism for targeting proteins to a specific cellular region. Moreover, prior research and studies have shown the significance of intracellular RNA positioning during embryonic and neural dendrite formation. Incorrect RNA localization, which can be caused by a variety of factors, such as mutations in trans-regulatory elements, has been linked to the development of certain neuromuscular diseases and cancer. In this study, we introduced NN-RNALoc, a neural network-based method for predicting the cellular location of mRNA using novel features extracted from mRNA sequence data and protein interaction patterns. In fact, we developed a distance-based subsequence profile for RNA sequence representation that is more memory and time-efficient than well-known k-mer sequence representation. Combining protein-protein interaction data, which is essential for numerous biological processes, with our novel distance-based subsequence profiles of mRNA sequences produces more accurate features. On two benchmark datasets, CeFra-Seq and RNALocate, the performance of NN-RNALoc is compared to powerful predictive models proposed in previous works (mRNALoc, RNATracker, mLoc-mRNA, DM3Loc, iLoc-mRNA, and EL-RMLocNet), and a ground neural (DNN5-mer) network. Compared to the previous methods, NN-RNALoc significantly reduces computation time and also outperforms them in terms of accuracy. This study's source code and datasets are freely accessible at https://github.com/NeginBabaiha/NN-RNALoc.
Collapse
Affiliation(s)
- Negin Sadat Babaiha
- Department of Computer and Data Sciences, Faculty of Mathematical Sciences, Shahid Beheshti University, Tehran, Iran
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, Germany
| | - Rosa Aghdam
- School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
- Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI, United States of America
| | - Shokoofeh Ghiam
- School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
| | - Changiz Eslahchi
- Department of Computer and Data Sciences, Faculty of Mathematical Sciences, Shahid Beheshti University, Tehran, Iran
- School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
| |
Collapse
|
6
|
Unal M, Bostanci E, Ozkul C, Acici K, Asuroglu T, Guzel MS. Crohn's Disease Prediction Using Sequence Based Machine Learning Analysis of Human Microbiome. Diagnostics (Basel) 2023; 13:2835. [PMID: 37685376 PMCID: PMC10486516 DOI: 10.3390/diagnostics13172835] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2023] [Revised: 08/24/2023] [Accepted: 08/31/2023] [Indexed: 09/10/2023] Open
Abstract
Human microbiota refers to the trillions of microorganisms that inhabit our bodies and have been discovered to have a substantial impact on human health and disease. By sampling the microbiota, it is possible to generate massive quantities of data for analysis using Machine Learning algorithms. In this study, we employed several modern Machine Learning techniques to predict Inflammatory Bowel Disease using raw sequence data. The dataset was obtained from NCBI preprocessed graph representations and converted into a structured form. Seven well-known Machine Learning frameworks, including Random Forest, Support Vector Machines, Extreme Gradient Boosting, Light Gradient Boosting Machine, Gaussian Naïve Bayes, Logistic Regression, and k-Nearest Neighbor, were used. Grid Search was employed for hyperparameter optimization. The performance of the Machine Learning models was evaluated using various metrics such as accuracy, precision, fscore, kappa, and area under the receiver operating characteristic curve. Additionally, Mc Nemar's test was conducted to assess the statistical significance of the experiment. The data was constructed using k-mer lengths of 3, 4 and 5. The Light Gradient Boosting Machine model overperformed over other models with 67.24%, 74.63% and 76.47% accuracy for k-mer lengths of 3, 4 and 5, respectively. The LightGBM model also demonstrated the best performance in each metric. The study showed promising results predicting disease from raw sequence data. Finally, Mc Nemar's test results found statistically significant differences between different Machine Learning approaches.
Collapse
Affiliation(s)
- Metehan Unal
- Department of Computer Engineering, Ankara University, 06830 Ankara, Turkey; (M.U.)
| | - Erkan Bostanci
- Department of Computer Engineering, Ankara University, 06830 Ankara, Turkey; (M.U.)
| | - Ceren Ozkul
- Department of Pharmaceutical Microbiology, Faculty of Pharmacy, Hacettepe University, 06230 Ankara, Turkey
| | - Koray Acici
- Department of Artificial Intelligence and Data Engineering, Ankara University, 06830 Ankara, Turkey
| | - Tunc Asuroglu
- Faculty of Medicine and Health Technology, Tampere University, 33720 Tampere, Finland
| | - Mehmet Serdar Guzel
- Department of Computer Engineering, Ankara University, 06830 Ankara, Turkey; (M.U.)
| |
Collapse
|
7
|
Cui Z, Wu Y, Zhang QH, Wang SG, He Y, Huang DS. MV-CVIB: a microbiome-based multi-view convolutional variational information bottleneck for predicting metastatic colorectal cancer. Front Microbiol 2023; 14:1238199. [PMID: 37675425 PMCID: PMC10477591 DOI: 10.3389/fmicb.2023.1238199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2023] [Accepted: 08/02/2023] [Indexed: 09/08/2023] Open
Abstract
Introduction Imbalances in gut microbes have been implied in many human diseases, including colorectal cancer (CRC), inflammatory bowel disease, type 2 diabetes, obesity, autism, and Alzheimer's disease. Compared with other human diseases, CRC is a gastrointestinal malignancy with high mortality and a high probability of metastasis. However, current studies mainly focus on the prediction of colorectal cancer while neglecting the more serious malignancy of metastatic colorectal cancer (mCRC). In addition, high dimensionality and small samples lead to the complexity of gut microbial data, which increases the difficulty of traditional machine learning models. Methods To address these challenges, we collected and processed 16S rRNA data and calculated abundance data from patients with non-metastatic colorectal cancer (non-mCRC) and mCRC. Different from the traditional health-disease classification strategy, we adopted a novel disease-disease classification strategy and proposed a microbiome-based multi-view convolutional variational information bottleneck (MV-CVIB). Results The experimental results show that MV-CVIB can effectively predict mCRC. This model can achieve AUC values above 0.9 compared to other state-of-the-art models. Not only that, MV-CVIB also achieved satisfactory predictive performance on multiple published CRC gut microbiome datasets. Discussion Finally, multiple gut microbiota analyses were used to elucidate communities and differences between mCRC and non-mCRC, and the metastatic properties of CRC were assessed by patient age and microbiota expression.
Collapse
Affiliation(s)
- Zhen Cui
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Yan Wu
- College of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Qin-Hu Zhang
- EIT Institute for Advanced Study, Ningbo, Zhejiang, China
| | - Si-Guo Wang
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Ying He
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, China
| | | |
Collapse
|
8
|
Syama K, Jothi JAA, Khanna N. Automatic disease prediction from human gut metagenomic data using boosting GraphSAGE. BMC Bioinformatics 2023; 24:126. [PMID: 37003965 PMCID: PMC10067187 DOI: 10.1186/s12859-023-05251-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2023] [Accepted: 03/23/2023] [Indexed: 04/03/2023] Open
Abstract
BACKGROUND The human microbiome plays a critical role in maintaining human health. Due to the recent advances in high-throughput sequencing technologies, the microbiome profiles present in the human body have become publicly available. Hence, many works have been done to analyze human microbiome profiles. These works have identified that different microbiome profiles are present in healthy and sick individuals for different diseases. Recently, several computational methods have utilized the microbiome profiles to automatically diagnose and classify the host phenotype. RESULTS In this work, a novel deep learning framework based on boosting GraphSAGE is proposed for automatic prediction of diseases from metagenomic data. The proposed framework has two main components, (a). Metagenomic Disease graph (MD-graph) construction module, (b). Disease prediction Network (DP-Net) module. The graph construction module constructs a graph by considering each metagenomic sample as a node in the graph. The graph captures the relationship between the samples using a proximity measure. The DP-Net consists of a boosting GraphSAGE model which predicts the status of a sample as sick or healthy. The effectiveness of the proposed method is verified using real and synthetic datasets corresponding to diseases like inflammatory bowel disease and colorectal cancer. The proposed model achieved a highest AUC of 93%, Accuracy of 95%, F1-score of 95%, AUPRC of 95% for the real inflammatory bowel disease dataset and a best AUC of 90%, Accuracy of 91%, F1-score of 87% and AUPRC of 93% for the real colorectal cancer dataset. CONCLUSION The proposed framework outperforms other machine learning and deep learning models in terms of classification accuracy, AUC, F1-score and AUPRC for both synthetic and real metagenomic data.
Collapse
Affiliation(s)
- K Syama
- Department of Computer Science, Birla Institute of Technology and Science Pilani Dubai Campus, Dubai International Academic City , Dubai, UAE
| | - J Angel Arul Jothi
- Department of Computer Science, Birla Institute of Technology and Science Pilani Dubai Campus, Dubai International Academic City , Dubai, UAE.
| | | |
Collapse
|
9
|
Zhai H, Fukuyama J. A convenient correspondence between k-mer-based metagenomic distances and phylogenetically-informed β-diversity measures. PLoS Comput Biol 2023; 19:e1010821. [PMID: 36608056 PMCID: PMC9879504 DOI: 10.1371/journal.pcbi.1010821] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2022] [Revised: 01/26/2023] [Accepted: 12/16/2022] [Indexed: 01/07/2023] Open
Abstract
k-mer-based distances are often used to describe the differences between communities in metagenome sequencing studies because of their computational convenience and history of effectiveness. Although k-mer-based distances do not use information about taxon abundances, we show that one class of k-mer distances between metagenomes (the Euclidean distance between k-mer spectra, or EKS distances) are very closely related to a class of phylogenetically-informed β-diversity measures that do explicitly use both the taxon abundances and information about the phylogenetic relationships among the taxa. Furthermore, we show that both of these distances can be interpreted as using certain features of the taxon abundances that are related to the phylogenetic tree. Our results allow practitioners to perform phylogenetically-informed analyses when they only have k-mer data available and provide a theoretical basis for using k-mer spectra with relatively small values of k (on the order of 4-5). They are also useful for analysts who wish to know more of the properties of any method based on k-mer spectra and provide insight into one class of phylogenetically-informed β-diversity measures.
Collapse
Affiliation(s)
- Hongxuan Zhai
- Department of Statistics, Indiana University Bloomington, Bloomington, Indiana, United States of America
| | - Julia Fukuyama
- Department of Statistics, Indiana University Bloomington, Bloomington, Indiana, United States of America
| |
Collapse
|
10
|
Loganathan T, Priya Doss C G. The influence of machine learning technologies in gut microbiome research and cancer studies - A review. Life Sci 2022; 311:121118. [DOI: 10.1016/j.lfs.2022.121118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Revised: 10/19/2022] [Accepted: 10/19/2022] [Indexed: 11/18/2022]
|
11
|
Wen LY, Zhang XM, Li QF, Min F. KGA: integrating KPCA and GAN for microbial data augmentation. INT J MACH LEARN CYB 2022. [DOI: 10.1007/s13042-022-01707-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
12
|
Ross EM, Hayes BJ. Metagenomic Predictions: A Review 10 years on. Front Genet 2022; 13:865765. [PMID: 35938022 PMCID: PMC9348756 DOI: 10.3389/fgene.2022.865765] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2022] [Accepted: 06/01/2022] [Indexed: 11/13/2022] Open
Abstract
Metagenomic predictions use variation in the metagenome (microbiome profile) to predict the unknown phenotype of the associated host. Metagenomic predictions were first developed 10 years ago, where they were used to predict which cattle would produce high or low levels of enteric methane. Since then, the approach has been applied to several traits and species including residual feed intake in cattle, and carcass traits, body mass index and disease state in pigs. Additionally, the method has been extended to include predictions based on other multi-dimensional data such as the metabolome, as well to combine genomic and metagenomic information. While there is still substantial optimisation required, the use of metagenomic predictions is expanding as DNA sequencing costs continue to fall and shows great promise particularly for traits heavily influenced by the microbiome such as feed efficiency and methane emissions.
Collapse
|
13
|
Borgman J, Stark K, Carson J, Hauser L. Deep Learning Encoding for Rapid Sequence Identification on Microbiome Data. FRONTIERS IN BIOINFORMATICS 2022; 2:871256. [PMID: 36304316 PMCID: PMC9580936 DOI: 10.3389/fbinf.2022.871256] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Accepted: 05/30/2022] [Indexed: 11/18/2022] Open
Abstract
We present a novel approach for rapidly identifying sequences that leverages the representational power of Deep Learning techniques and is applied to the analysis of microbiome data. The method involves the creation of a latent sequence space, training a convolutional neural network to rapidly identify sequences by mapping them into that space, and we leverage the novel encoded latent space for denoising to correct sequencing errors. Using mock bacterial communities of known composition, we show that this approach achieves single nucleotide resolution, generating results for sequence identification and abundance estimation that match the best available microbiome algorithms in terms of accuracy while vastly increasing the speed of accurate processing. We further show the ability of this approach to support phenotypic prediction at the sample level on an experimental data set for which the ground truth for sequence identities and abundances is unknown, but the expected phenotypes of the samples are definitive. Moreover, this approach offers a potential solution for the analysis of data from other types of experiments that currently rely on computationally intensive sequence identification.
Collapse
|
14
|
McElhinney JMWR, Catacutan MK, Mawart A, Hasan A, Dias J. Interfacing Machine Learning and Microbial Omics: A Promising Means to Address Environmental Challenges. Front Microbiol 2022; 13:851450. [PMID: 35547145 PMCID: PMC9083327 DOI: 10.3389/fmicb.2022.851450] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2022] [Accepted: 03/14/2022] [Indexed: 11/13/2022] Open
Abstract
Microbial communities are ubiquitous and carry an exceptionally broad metabolic capability. Upon environmental perturbation, microbes are also amongst the first natural responsive elements with perturbation-specific cues and markers. These communities are thereby uniquely positioned to inform on the status of environmental conditions. The advent of microbial omics has led to an unprecedented volume of complex microbiological data sets. Importantly, these data sets are rich in biological information with potential for predictive environmental classification and forecasting. However, the patterns in this information are often hidden amongst the inherent complexity of the data. There has been a continued rise in the development and adoption of machine learning (ML) and deep learning architectures for solving research challenges of this sort. Indeed, the interface between molecular microbial ecology and artificial intelligence (AI) appears to show considerable potential for significantly advancing environmental monitoring and management practices through their application. Here, we provide a primer for ML, highlight the notion of retaining biological sample information for supervised ML, discuss workflow considerations, and review the state of the art of the exciting, yet nascent, interdisciplinary field of ML-driven microbial ecology. Current limitations in this sphere of research are also addressed to frame a forward-looking perspective toward the realization of what we anticipate will become a pivotal toolkit for addressing environmental monitoring and management challenges in the years ahead.
Collapse
Affiliation(s)
- James M. W. R. McElhinney
- Applied Genomics Laboratory, Center for Membranes and Advanced Water Technology, Khalifa University, Abu Dhabi, United Arab Emirates
| | | | - Aurelie Mawart
- Applied Genomics Laboratory, Center for Membranes and Advanced Water Technology, Khalifa University, Abu Dhabi, United Arab Emirates
| | - Ayesha Hasan
- Applied Genomics Laboratory, Center for Membranes and Advanced Water Technology, Khalifa University, Abu Dhabi, United Arab Emirates
- Department of Biomedical Engineering, Khalifa University, Abu Dhabi, United Arab Emirates
| | - Jorge Dias
- EECS, Center for Autonomous Robotic Systems, Khalifa University, Abu Dhabi, United Arab Emirates
| |
Collapse
|
15
|
Grazioli F, Siarheyeu R, Alqassem I, Henschel A, Pileggi G, Meiser A. Microbiome-based disease prediction with multimodal variational information bottlenecks. PLoS Comput Biol 2022; 18:e1010050. [PMID: 35404958 PMCID: PMC9022840 DOI: 10.1371/journal.pcbi.1010050] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Revised: 04/21/2022] [Accepted: 03/22/2022] [Indexed: 01/12/2023] Open
Abstract
Scientific research is shedding light on the interaction of the gut microbiome with the human host and on its role in human health. Existing machine learning methods have shown great potential in discriminating healthy from diseased microbiome states. Most of them leverage shotgun metagenomic sequencing to extract gut microbial species-relative abundances or strain-level markers. Each of these gut microbial profiling modalities showed diagnostic potential when tested separately; however, no existing approach combines them in a single predictive framework. Here, we propose the Multimodal Variational Information Bottleneck (MVIB), a novel deep learning model capable of learning a joint representation of multiple heterogeneous data modalities. MVIB achieves competitive classification performance while being faster than existing methods. Additionally, MVIB offers interpretable results. Our model adopts an information theoretic interpretation of deep neural networks and computes a joint stochastic encoding of different input data modalities. We use MVIB to predict whether human hosts are affected by a certain disease by jointly analysing gut microbial species-relative abundances and strain-level markers. MVIB is evaluated on human gut metagenomic samples from 11 publicly available disease cohorts covering 6 different diseases. We achieve high performance (0.80 < ROC AUC < 0.95) on 5 cohorts and at least medium performance on the remaining ones. We adopt a saliency technique to interpret the output of MVIB and identify the most relevant microbial species and strain-level markers to the model’s predictions. We also perform cross-study generalisation experiments, where we train and test MVIB on different cohorts of the same disease, and overall we achieve comparable results to the baseline approach, i.e. the Random Forest. Further, we evaluate our model by adding metabolomic data derived from mass spectrometry as a third input modality. Our method is scalable with respect to input data modalities and has an average training time of < 1.4 seconds. The source code and the datasets used in this work are publicly available. The gut microbiome can be an indicator of various diseases due to its interaction with the human system. Our main objective is to improve on the current state of the art in microbiome classification for diagnostic purposes. A rich body of literature evidences the clinical value of microbiome predictive models. Here, we propose the Multimodal Variational Information Bottleneck (MVIB), a novel deep learning model for microbiome-based disease prediction. MVIB learns a joint stochastic encoding of different input data modalities to predict the output class. We use MVIB to predict whether human hosts are affected by a certain disease by jointly analysing gut microbial species-relative abundance and strain-level marker profiles. Both of these gut microbial features showed diagnostic potential when tested separately in previous studies; however, no research has combined them in a single predictive tool. We evaluate MVIB on various human gut metagenomic samples from 11 publicly available disease cohorts. MVIB achieves competitive performance compared to state-of-the-art methods. Additionally, we evaluate our model by adding metabolomic data as a third input modality and we show that MVIB is scalable with respect to input feature modalities. Further, we adopt a saliency technique to interpret the output of MVIB and identify the most relevant microbial species and strain-level markers to our model predictions.
Collapse
Affiliation(s)
| | | | | | - Andreas Henschel
- Department of Electrical Engineering and Computer Science, Khalifa University, Abu Dhabi, UAE
- Research and Data Intelligence Support Center, Khalifa University, Abu Dhabi, UAE
| | | | | |
Collapse
|
16
|
Michel‐Mata S, Wang X, Liu Y, Angulo MT. Predicting microbiome compositions from species assemblages through deep learning. IMETA 2022; 1:e3. [PMID: 35757098 PMCID: PMC9221840 DOI: 10.1002/imt2.3] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Revised: 12/29/2021] [Accepted: 01/04/2022] [Indexed: 05/13/2023]
Abstract
Microbes can form complex communities that perform critical functions in maintaining the integrity of their environment or their hosts' well-being. Rationally managing these microbial communities requires improving our ability to predict how different species assemblages affect the final species composition of the community. However, making such a prediction remains challenging because of our limited knowledge of the diverse physical, biochemical, and ecological processes governing microbial dynamics. To overcome this challenge, we present a deep learning framework that automatically learns the map between species assemblages and community compositions from training data only, without knowing any of the above processes. First, we systematically validate our framework using synthetic data generated by classical population dynamics models. Then, we apply our framework to data from in vitro and in vivo microbial communities, including ocean and soil microbiota, Drosophila melanogaster gut microbiota, and human gut and oral microbiota. We find that our framework learns to perform accurate out-of-sample predictions of complex community compositions from a small number of training samples. Our results demonstrate how deep learning can enable us to understand better and potentially manage complex microbial communities.
Collapse
Affiliation(s)
- Sebastian Michel‐Mata
- Center for Applied Physics and Advanced TechnologyUniversidad Nacional Autónoma de MéxicoJuriquillaMexico
- Department of Ecology and Evolutionary BiologyPrinceton UniversityPrincetonNew JerseyUSA
| | - Xu‐Wen Wang
- Channing Division of Network Medicine, Department of MedicineBrigham and Women's Hospital and Harvard Medical SchoolBostonMassachusettsUSA
| | - Yang‐Yu Liu
- Channing Division of Network Medicine, Department of MedicineBrigham and Women's Hospital and Harvard Medical SchoolBostonMassachusettsUSA
| | - Marco Tulio Angulo
- CONACyT—Institute of MathematicsUniversidad Nacional Autónoma de MéxicoJuriquillaMexico
| |
Collapse
|
17
|
Multimodal deep learning applied to classify healthy and disease states of human microbiome. Sci Rep 2022; 12:824. [PMID: 35039534 PMCID: PMC8763943 DOI: 10.1038/s41598-022-04773-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Accepted: 12/30/2021] [Indexed: 12/22/2022] Open
Abstract
Metagenomic sequencing methods provide considerable genomic information regarding human microbiomes, enabling us to discover and understand microbial diseases. Compositional differences have been reported between patients and healthy people, which could be used in the diagnosis of patients. Despite significant progress in this regard, the accuracy of these tools needs to be improved for applications in diagnostics and therapeutics. MDL4Microbiome, the method developed herein, demonstrated high accuracy in predicting disease status by using various features from metagenome sequences and a multimodal deep learning model. We propose combining three different features, i.e., conventional taxonomic profiles, genome-level relative abundance, and metabolic functional characteristics, to enhance classification accuracy. This deep learning model enabled the construction of a classifier that combines these various modalities encoded in the human microbiome. We achieved accuracies of 0.98, 0.76, 0.84, and 0.97 for predicting patients with inflammatory bowel disease, type 2 diabetes, liver cirrhosis, and colorectal cancer, respectively; these are comparable or higher than classical machine learning methods. A deeper analysis was also performed on the resulting sets of selected features to understand the contribution of their different characteristics. MDL4Microbiome is a classifier with higher or comparable accuracy compared with other machine learning methods, which offers perspectives on feature generation with metagenome sequences in deep learning models and their advantages in the classification of host disease status.
Collapse
|
18
|
Curry KD, Nute MG, Treangen TJ. It takes guts to learn: machine learning techniques for disease detection from the gut microbiome. Emerg Top Life Sci 2021; 5:815-827. [PMID: 34779841 PMCID: PMC8786294 DOI: 10.1042/etls20210213] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Revised: 09/29/2021] [Accepted: 10/06/2021] [Indexed: 02/01/2023]
Abstract
Associations between the human gut microbiome and expression of host illness have been noted in a variety of conditions ranging from gastrointestinal dysfunctions to neurological deficits. Machine learning (ML) methods have generated promising results for disease prediction from gut metagenomic information for diseases including liver cirrhosis and irritable bowel disease, but have lacked efficacy when predicting other illnesses. Here, we review current ML methods designed for disease classification from microbiome data. We highlight the computational challenges these methods have effectively overcome and discuss the biological components that have been overlooked to offer perspectives on future work in this area.
Collapse
Affiliation(s)
- Kristen D. Curry
- Department of Computer Science, Rice University, Houston, TX 77005, USA
| | - Michael G. Nute
- Department of Computer Science, Rice University, Houston, TX 77005, USA
| | - Todd J. Treangen
- Department of Computer Science, Rice University, Houston, TX 77005, USA
| |
Collapse
|
19
|
Decoding gut microbiota by imaging analysis of fecal samples. iScience 2021; 24:103481. [PMID: 34927025 PMCID: PMC8652011 DOI: 10.1016/j.isci.2021.103481] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2019] [Revised: 09/21/2021] [Accepted: 11/19/2021] [Indexed: 01/09/2023] Open
Abstract
The gut microbiota plays a crucial role in maintaining health. Monitoring the complex dynamics of its microbial population is, therefore, important. Here, we present a deep convolution network that can characterize the dynamic changes in the gut microbiota using low-resolution images of fecal samples. Further, we demonstrate that the microbial relative abundances, quantified via 16S rRNA amplicon sequencing, can be quantitatively predicted by the neural network. Our approach provides a simple and inexpensive method of gut microbiota analysis. A deep convolution network classifies gut microbiota based on fecal sample images Image-based quantitative prediction of gut microbiota composition is demonstrated This result provides a simple and inexpensive method of gut microbiota analysis
Collapse
|
20
|
Narayana JK, Mac Aogáin M, Goh WWB, Xia K, Tsaneva-Atanasova K, Chotirmall SH. Mathematical-based microbiome analytics for clinical translation. Comput Struct Biotechnol J 2021; 19:6272-6281. [PMID: 34900137 PMCID: PMC8637001 DOI: 10.1016/j.csbj.2021.11.029] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2021] [Revised: 11/17/2021] [Accepted: 11/17/2021] [Indexed: 12/20/2022] Open
Abstract
Traditionally, human microbiology has been strongly built on the laboratory focused culture of microbes isolated from human specimens in patients with acute or chronic infection. These approaches primarily view human disease through the lens of a single species and its relevant clinical setting however such approaches fail to account for the surrounding environment and wide microbial diversity that exists in vivo. Given the emergence of next generation sequencing technologies and advancing bioinformatic pipelines, researchers now have unprecedented capabilities to characterise the human microbiome in terms of its taxonomy, function, antibiotic resistance and even bacteriophages. Despite this, an analysis of microbial communities has largely been restricted to ordination, ecological measures, and discriminant taxa analysis. This is predominantly due to a lack of suitable computational tools to facilitate microbiome analytics. In this review, we first evaluate the key concerns related to the inherent structure of microbiome datasets which include its compositionality and batch effects. We describe the available and emerging analytical techniques including integrative analysis, machine learning, microbial association networks, topological data analysis (TDA) and mathematical modelling. We also present how these methods may translate to clinical settings including tools for implementation. Mathematical based analytics for microbiome analysis represents a promising avenue for clinical translation across a range of acute and chronic disease states.
Collapse
Affiliation(s)
- Jayanth Kumar Narayana
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore
| | - Micheál Mac Aogáin
- Biochemical Genetics Laboratory, Department of Biochemistry, St. James’s Hospital, Dublin, Ireland
- Clinical Biochemistry Unit, School of Medicine, Trinity College Dublin, Dublin, Ireland
| | - Wilson Wen Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, Singapore
| | - Krasimira Tsaneva-Atanasova
- Department of Mathematics & Living Systems Institute, College of Engineering, Mathematics and Physical Sciences, University of Exeter, Exeter EX4 4QF, UK
| | - Sanjay H. Chotirmall
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore
- Department of Respiratory and Critical Care Medicine, Tan Tock Seng Hospital, Singapore
| |
Collapse
|
21
|
Deng Z, Zhang J, Li J, Zhang X. Application of Deep Learning in Plant-Microbiota Association Analysis. Front Genet 2021; 12:697090. [PMID: 34691142 PMCID: PMC8531731 DOI: 10.3389/fgene.2021.697090] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2021] [Accepted: 08/31/2021] [Indexed: 01/04/2023] Open
Abstract
Unraveling the association between microbiome and plant phenotype can illustrate the effect of microbiome on host and then guide the agriculture management. Adequate identification of species and appropriate choice of models are two challenges in microbiome data analysis. Computational models of microbiome data could help in association analysis between the microbiome and plant host. The deep learning methods have been widely used to learn the microbiome data due to their powerful strength of handling the complex, sparse, noisy, and high-dimensional data. Here, we review the analytic strategies in the microbiome data analysis and describe the applications of deep learning models for plant–microbiome correlation studies. We also introduce the application cases of different models in plant–microbiome correlation analysis and discuss how to adapt the models on the critical steps in data processing. From the aspect of data processing manner, model structure, and operating principle, most deep learning models are suitable for the plant microbiome data analysis. The ability of feature representation and pattern recognition is the advantage of deep learning methods in modeling and interpretation for association analysis. Based on published computational experiments, the convolutional neural network and graph neural networks could be recommended for plant microbiome analysis.
Collapse
Affiliation(s)
- Zhiyu Deng
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, China.,Center of Economic Botany, Core Botanical Gardens, Chinese Academy of Sciences, Wuhan, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Jinming Zhang
- Department of Infectious Diseases, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, China
| | - Junya Li
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, China.,Center of Economic Botany, Core Botanical Gardens, Chinese Academy of Sciences, Wuhan, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Xiujun Zhang
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, China.,Center of Economic Botany, Core Botanical Gardens, Chinese Academy of Sciences, Wuhan, China
| |
Collapse
|
22
|
Chrisman BS, Paskov KM, Stockham N, Jung JY, Varma M, Washington PY, Tataru C, Iwai S, DeSantis TZ, David M, Wall DP. Improved detection of disease-associated gut microbes using 16S sequence-based biomarkers. BMC Bioinformatics 2021; 22:509. [PMID: 34666677 PMCID: PMC8527694 DOI: 10.1186/s12859-021-04427-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2020] [Accepted: 10/06/2021] [Indexed: 12/31/2022] Open
Abstract
Background Sequencing partial 16S rRNA genes is a cost effective method for quantifying the microbial composition of an environment, such as the human gut. However, downstream analysis relies on binning reads into microbial groups by either considering each unique sequence as a different microbe, querying a database to get taxonomic labels from sequences, or clustering similar sequences together. However, these approaches do not fully capture evolutionary relationships between microbes, limiting the ability to identify differentially abundant groups of microbes between a diseased and control cohort. We present sequence-based biomarkers (SBBs), an aggregation method that groups and aggregates microbes using single variants and combinations of variants within their 16S sequences. We compare SBBs against other existing aggregation methods (OTU clustering and Microphenoor DiTaxa features) in several benchmarking tasks: biomarker discovery via permutation test, biomarker discovery via linear discriminant analysis, and phenotype prediction power. We demonstrate the SBBs perform on-par or better than the state-of-the-art methods in biomarker discovery and phenotype prediction. Results On two independent datasets, SBBs identify differentially abundant groups of microbes with similar or higher statistical significance than existing methods in both a permutation-test-based analysis and using linear discriminant analysis effect size. . By grouping microbes by SBB, we can identify several differentially abundant microbial groups (FDR <.1) between children with autism and neurotypical controls in a set of 115 discordant siblings. Porphyromonadaceae, Ruminococcaceae, and an unnamed species of Blastocystis were significantly enriched in autism, while Veillonellaceae was significantly depleted. Likewise, aggregating microbes by SBB on a dataset of obese and lean twins, we find several significantly differentially abundant microbial groups (FDR<.1). We observed Megasphaera andSutterellaceae highly enriched in obesity, and Phocaeicola significantly depleted. SBBs also perform on bar with or better than existing aggregation methods as features in a phenotype prediction model, predicting the autism phenotype with an ROC-AUC score of .64 and the obesity phenotype with an ROC-AUC score of .84. Conclusions SBBs provide a powerful method for aggregating microbes to perform differential abundance analysis as well as phenotype prediction. Our source code can be freely downloaded from http://github.com/briannachrisman/16s_biomarkers.
Collapse
Affiliation(s)
- Brianna S Chrisman
- Department of Bioengineering, Stanford University, Serra Mall, Stanford, USA.
| | - Kelley M Paskov
- Department of Biomedical Data Science, Stanford University, Serra Mall, Stanford, USA
| | - Nate Stockham
- Department of Neuroscience, Stanford University, Serra Mall, Stanford, USA
| | - Jae-Yoon Jung
- Department of Biomedical Data Science, Stanford University, Serra Mall, Stanford, USA
| | - Maya Varma
- Department of Computer Science, Stanford University, Serra Mall, Stanford, USA
| | - Peter Y Washington
- Department of Bioengineering, Stanford University, Serra Mall, Stanford, USA
| | - Christine Tataru
- Department of Computer Science, Oregon State University, SW Campus Way, Corvallis, USA
| | - Shoko Iwai
- Second Genome Inc, Allerton Ave, Brisbane, USA
| | | | - Maude David
- Department of Microbiology, Oregon State University, SW Campus Way, Corvallis, USA
| | - Dennis P Wall
- Department of Biomedical Data Science, Stanford University, Serra Mall, Stanford, USA.,Department of Pediatrics (Systems Medicine), Stanford University, 1265 Welch Road, Stanford, USA
| |
Collapse
|
23
|
Zhao Z, Woloszynek S, Agbavor F, Mell JC, Sokhansanj BA, Rosen GL. Learning, visualizing and exploring 16S rRNA structure using an attention-based deep neural network. PLoS Comput Biol 2021; 17:e1009345. [PMID: 34550967 PMCID: PMC8496832 DOI: 10.1371/journal.pcbi.1009345] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 10/07/2021] [Accepted: 08/12/2021] [Indexed: 01/04/2023] Open
Abstract
Recurrent neural networks with memory and attention mechanisms are widely used in natural language processing because they can capture short and long term sequential information for diverse tasks. We propose an integrated deep learning model for microbial DNA sequence data, which exploits convolutional neural networks, recurrent neural networks, and attention mechanisms to predict taxonomic classifications and sample-associated attributes, such as the relationship between the microbiome and host phenotype, on the read/sequence level. In this paper, we develop this novel deep learning approach and evaluate its application to amplicon sequences. We apply our approach to short DNA reads and full sequences of 16S ribosomal RNA (rRNA) marker genes, which identify the heterogeneity of a microbial community sample. We demonstrate that our implementation of a novel attention-based deep network architecture, Read2Pheno, achieves read-level phenotypic prediction. Training Read2Pheno models will encode sequences (reads) into dense, meaningful representations: learned embedded vectors output from the intermediate layer of the network model, which can provide biological insight when visualized. The attention layer of Read2Pheno models can also automatically identify nucleotide regions in reads/sequences which are particularly informative for classification. As such, this novel approach can avoid pre/post-processing and manual interpretation required with conventional approaches to microbiome sequence classification. We further show, as proof-of-concept, that aggregating read-level information can robustly predict microbial community properties, host phenotype, and taxonomic classification, with performance at least comparable to conventional approaches. An implementation of the attention-based deep learning network is available at https://github.com/EESI/sequence_attention (a python package) and https://github.com/EESI/seq2att (a command line tool).
Collapse
Affiliation(s)
- Zhengqiao Zhao
- Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Stephen Woloszynek
- Beth Israel Deaconess Medical Center, Boston, Massachusetts, United States of America
- Harvard Medical School, Boston, Massachusetts, United States of America
| | - Felix Agbavor
- School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Joshua Chang Mell
- College of Medicine, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Bahrad A. Sokhansanj
- Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Gail L. Rosen
- Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America
| |
Collapse
|
24
|
García-Jiménez B, Muñoz J, Cabello S, Medina J, Wilkinson MD. Predicting microbiomes through a deep latent space. Bioinformatics 2021; 37:1444-1451. [PMID: 33289510 PMCID: PMC8208755 DOI: 10.1093/bioinformatics/btaa971] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Revised: 10/21/2020] [Accepted: 11/06/2020] [Indexed: 12/28/2022] Open
Abstract
Motivation Microbial communities influence their environment by modifying the availability of compounds, such as nutrients or chemical elicitors. Knowing the microbial composition of a site is therefore relevant to improve productivity or health. However, sequencing facilities are not always available, or may be prohibitively expensive in some cases. Thus, it would be desirable to computationally predict the microbial composition from more accessible, easily-measured features. Results Integrating deep learning techniques with microbiome data, we propose an artificial neural network architecture based on heterogeneous autoencoders to condense the long vector of microbial abundance values into a deep latent space representation. Then, we design a model to predict the deep latent space and, consequently, to predict the complete microbial composition using environmental features as input. The performance of our system is examined using the rhizosphere microbiome of Maize. We reconstruct the microbial composition (717 taxa) from the deep latent space (10 values) with high fidelity (>0.9 Pearson correlation). We then successfully predict microbial composition from environmental variables, such as plant age, temperature or precipitation (0.73 Pearson correlation, 0.42 Bray–Curtis). We extend this to predict microbiome composition under hypothetical scenarios, such as future climate change conditions. Finally, via transfer learning, we predict microbial composition in a distinct scenario with only 100 sequences, and distinct environmental features. We propose that our deep latent space may assist microbiome-engineering strategies when technical or financial resources are limited, through predicting current or future microbiome compositions. Availability and implementation Software, results and data are available at https://github.com/jorgemf/DeepLatentMicrobiome Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Beatriz García-Jiménez
- Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223-Pozuelo de Alarcón, Madrid, Spain
| | - Jorge Muñoz
- Serendeepia Research, 28905 Getafe (Madrid), Spain
| | - Sara Cabello
- Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223-Pozuelo de Alarcón, Madrid, Spain
| | - Joaquín Medina
- Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223-Pozuelo de Alarcón, Madrid, Spain
| | - Mark D Wilkinson
- Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223-Pozuelo de Alarcón, Madrid, Spain.,Departamento de Biotecnología-Biología Vegetal, Escuela Técnica Superior de Ingeniería Agronómica, Alimentaria y de Biosistemas, Universidad Politécnica de Madrid (UPM), Madrid, Spain
| |
Collapse
|
25
|
Bahai A, Asgari E, Mofrad MRK, Kloetgen A, McHardy AC. EpitopeVec: Linear Epitope Prediction Using Deep Protein Sequence Embeddings. Bioinformatics 2021; 37:4517-4525. [PMID: 34180989 PMCID: PMC8652027 DOI: 10.1093/bioinformatics/btab467] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2020] [Revised: 05/28/2021] [Accepted: 06/25/2021] [Indexed: 11/19/2022] Open
Abstract
Motivation B-cell epitopes (BCEs) play a pivotal role in the development of peptide vaccines, immuno-diagnostic reagents and antibody production, and thus in infectious disease prevention and diagnostics in general. Experimental methods used to determine BCEs are costly and time-consuming. Therefore, it is essential to develop computational methods for the rapid identification of BCEs. Although several computational methods have been developed for this task, generalizability is still a major concern, where cross-testing of the classifiers trained and tested on different datasets has revealed accuracies of 51–53%. Results We describe a new method called EpitopeVec, which uses a combination of residue properties, modified antigenicity scales, and protein language model-based representations (protein vectors) as features of peptides for linear BCE predictions. Extensive benchmarking of EpitopeVec and other state-of-the-art methods for linear BCE prediction on several large and small datasets, as well as cross-testing, demonstrated an improvement in the performance of EpitopeVec over other methods in terms of accuracy and area under the curve. As the predictive performance depended on the species origin of the respective antigens (viral, bacterial and eukaryotic), we also trained our method on a large viral dataset to create a dedicated linear viral BCE predictor with improved cross-testing performance. Availability and implementation The software is available at https://github.com/hzi-bifo/epitope-prediction. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Akash Bahai
- Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124 Braunschweig, Germany.,Braunschweig Integrated Center of Systems Biology (BRICS), Technische Universität Braunschweig, Rebenring 56, 38106 Braunschweig
| | - Ehsaneddin Asgari
- Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124 Braunschweig, Germany.,Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, CA, 94720, USA
| | - Mohammad R K Mofrad
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, CA, 94720, USA.,Molecular Biophysics and Integrated Bioimaging, Lawrence Berkeley National Lab, Berkeley, CA 94720, USA
| | - Andreas Kloetgen
- Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124 Braunschweig, Germany
| | - Alice C McHardy
- Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124 Braunschweig, Germany.,Braunschweig Integrated Center of Systems Biology (BRICS), Technische Universität Braunschweig, Rebenring 56, 38106 Braunschweig
| |
Collapse
|
26
|
Chen X, Liu L, Zhang W, Yang J, Wong KC. Human host status inference from temporal microbiome changes via recurrent neural networks. Brief Bioinform 2021; 22:6307015. [PMID: 34151933 DOI: 10.1093/bib/bbab223] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 04/21/2021] [Accepted: 04/21/2021] [Indexed: 01/04/2023] Open
Abstract
With the rapid increase in sequencing data, human host status inference (e.g. healthy or sick) from microbiome data has become an important issue. Existing studies are mostly based on single-point microbiome composition, while it is rare that the host status is predicted from longitudinal microbiome data. However, single-point-based methods cannot capture the dynamic patterns between the temporal changes and host status. Therefore, it remains challenging to build good predictive models as well as scaling to different microbiome contexts. On the other hand, existing methods are mainly targeted for disease prediction and seldom investigate other host statuses. To fill the gap, we propose a comprehensive deep learning-based framework that utilizes longitudinal microbiome data as input to infer the human host status. Specifically, the framework is composed of specific data preparation strategies and a recurrent neural network tailored for longitudinal microbiome data. In experiments, we evaluated the proposed method on both semi-synthetic and real datasets based on different sequencing technologies and metagenomic contexts. The results indicate that our method achieves robust performance compared to other baseline and state-of-the-art classifiers and provides a significant reduction in prediction time.
Collapse
Affiliation(s)
- Xingjian Chen
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| | - Lingjing Liu
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| | - Weitong Zhang
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| | - Jianyi Yang
- School of Mathematical Sciences, Nankai University, Kowloon, Hong Kong SAR
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| |
Collapse
|
27
|
Carrieri AP, Haiminen N, Maudsley-Barton S, Gardiner LJ, Murphy B, Mayes AE, Paterson S, Grimshaw S, Winn M, Shand C, Hadjidoukas P, Rowe WPM, Hawkins S, MacGuire-Flanagan A, Tazzioli J, Kenny JG, Parida L, Hoptroff M, Pyzer-Knapp EO. Explainable AI reveals changes in skin microbiome composition linked to phenotypic differences. Sci Rep 2021; 11:4565. [PMID: 33633172 PMCID: PMC7907326 DOI: 10.1038/s41598-021-83922-6] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2020] [Accepted: 02/08/2021] [Indexed: 02/06/2023] Open
Abstract
Alterations in the human microbiome have been observed in a variety of conditions such as asthma, gingivitis, dermatitis and cancer, and much remains to be learned about the links between the microbiome and human health. The fusion of artificial intelligence with rich microbiome datasets can offer an improved understanding of the microbiome's role in human health. To gain actionable insights it is essential to consider both the predictive power and the transparency of the models by providing explanations for the predictions. We combine the collection of leg skin microbiome samples from two healthy cohorts of women with the application of an explainable artificial intelligence (EAI) approach that provides accurate predictions of phenotypes with explanations. The explanations are expressed in terms of variations in the relative abundance of key microbes that drive the predictions. We predict skin hydration, subject's age, pre/post-menopausal status and smoking status from the leg skin microbiome. The changes in microbial composition linked to skin hydration can accelerate the development of personalized treatments for healthy skin, while those associated with age may offer insights into the skin aging process. The leg microbiome signatures associated with smoking and menopausal status are consistent with previous findings from oral/respiratory tract microbiomes and vaginal/gut microbiomes respectively. This suggests that easily accessible microbiome samples could be used to investigate health-related phenotypes, offering potential for non-invasive diagnosis and condition monitoring. Our EAI approach sets the stage for new work focused on understanding the complex relationships between microbial communities and phenotypes. Our approach can be applied to predict any condition from microbiome samples and has the potential to accelerate the development of microbiome-based personalized therapeutics and non-invasive diagnostics.
Collapse
Affiliation(s)
- Anna Paola Carrieri
- The Hartree Centre, Sci-Tech Daresbury, IBM Research, Daresbury, WA4 4AD, UK.
| | - Niina Haiminen
- T.J. Watson Research Center, IBM Research, Yorktown Heights, NY, 10598, USA
| | - Sean Maudsley-Barton
- The Hartree Centre, Sci-Tech Daresbury, IBM Research, Daresbury, WA4 4AD, UK
- Department of Computing and Mathematics, Manchester Metropolitan University (MUU), Manchester, M15 6BH, UK
| | | | - Barry Murphy
- Unilever Research & Development, Port Sunlight, CH63 3JW, UK
| | - Andrew E Mayes
- Unilever Research and Development, Sharnbrook, MK44 1LQ, UK
| | - Sarah Paterson
- Unilever Research & Development, Port Sunlight, CH63 3JW, UK
| | - Sally Grimshaw
- Unilever Research & Development, Port Sunlight, CH63 3JW, UK
| | - Martyn Winn
- Scientific Computing Department, STFC Daresbury Lab, Daresbury, WA4 4AD, UK
| | - Cameron Shand
- The Hartree Centre, Sci-Tech Daresbury, IBM Research, Daresbury, WA4 4AD, UK
- Department of Computer Science, University of Manchester (UoM), Manchester, M13 9LP, UK
| | | | | | - Stacy Hawkins
- Unilever Research & Development, Trumbull, CT, 06611, USA
| | | | - Jane Tazzioli
- Unilever Research & Development, Trumbull, CT, 06611, USA
| | - John G Kenny
- Institute of Integrative Biology, The University of Liverpool, The Bioscience Building, Liverpool, L697ZB, UK
| | - Laxmi Parida
- T.J. Watson Research Center, IBM Research, Yorktown Heights, NY, 10598, USA
| | | | | |
Collapse
|
28
|
Moreno-Indias I, Lahti L, Nedyalkova M, Elbere I, Roshchupkin G, Adilovic M, Aydemir O, Bakir-Gungor B, Santa Pau ECD, D’Elia D, Desai MS, Falquet L, Gundogdu A, Hron K, Klammsteiner T, Lopes MB, Marcos-Zambrano LJ, Marques C, Mason M, May P, Pašić L, Pio G, Pongor S, Promponas VJ, Przymus P, Saez-Rodriguez J, Sampri A, Shigdel R, Stres B, Suharoschi R, Truu J, Truică CO, Vilne B, Vlachakis D, Yilmaz E, Zeller G, Zomer AL, Gómez-Cabrero D, Claesson MJ. Statistical and Machine Learning Techniques in Human Microbiome Studies: Contemporary Challenges and Solutions. Front Microbiol 2021; 12:635781. [PMID: 33692771 PMCID: PMC7937616 DOI: 10.3389/fmicb.2021.635781] [Citation(s) in RCA: 39] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Accepted: 01/28/2021] [Indexed: 12/23/2022] Open
Abstract
The human microbiome has emerged as a central research topic in human biology and biomedicine. Current microbiome studies generate high-throughput omics data across different body sites, populations, and life stages. Many of the challenges in microbiome research are similar to other high-throughput studies, the quantitative analyses need to address the heterogeneity of data, specific statistical properties, and the remarkable variation in microbiome composition across individuals and body sites. This has led to a broad spectrum of statistical and machine learning challenges that range from study design, data processing, and standardization to analysis, modeling, cross-study comparison, prediction, data science ecosystems, and reproducible reporting. Nevertheless, although many statistics and machine learning approaches and tools have been developed, new techniques are needed to deal with emerging applications and the vast heterogeneity of microbiome data. We review and discuss emerging applications of statistical and machine learning techniques in human microbiome studies and introduce the COST Action CA18131 "ML4Microbiome" that brings together microbiome researchers and machine learning experts to address current challenges such as standardization of analysis pipelines for reproducibility of data analysis results, benchmarking, improvement, or development of existing and new tools and ontologies.
Collapse
Affiliation(s)
- Isabel Moreno-Indias
- Instituto de Investigación Biomédica de Málaga (IBIMA), Unidad de Gestión Clìnica de Endocrinologìa y Nutrición, Hospital Clìnico Universitario Virgen de la Victoria, Universidad de Málaga, Málaga, Spain
- Centro de Investigación Biomeìdica en Red de Fisiopatologtìa de la Obesidad y la Nutrición (CIBEROBN), Instituto de Salud Carlos III, Madrid, Spain
| | - Leo Lahti
- Department of Computing, University of Turku, Turku, Finland
| | - Miroslava Nedyalkova
- Human Genetics and Disease Mechanisms, Latvian Biomedical Research and Study Centre, Riga, Latvia
| | - Ilze Elbere
- Latvian Biomedical Research and Study Centre, Riga, Latvia
| | | | - Muhamed Adilovic
- Department of Genetics and Bioengineering, International University of Sarajevo, Sarajevo, Bosnia and Herzegovina
| | - Onder Aydemir
- Department of Electrical and Electronics Engineering, Karadeniz Technical University, Trabzon, Turkey
| | - Burcu Bakir-Gungor
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| | | | - Domenica D’Elia
- Department for Biomedical Sciences, Institute for Biomedical Technologies, National Research Council, Bari, Italy
| | - Mahesh S. Desai
- Department of Infection and Immunity, Luxembourg Institute of Health, Esch-sur-Alzette, Luxembourg
- Odense Research Center for Anaphylaxis, Department of Dermatology and Allergy Center, Odense University Hospital, University of Southern Denmark, Odense, Denmark
| | - Laurent Falquet
- Department of Biology, University of Fribourg, Fribourg, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Aycan Gundogdu
- Department of Microbiology and Clinical Microbiology, Faculty of Medicine, Erciyes University, Kayseri, Turkey
- Metagenomics Laboratory, Genome and Stem Cell Center (GenKök), Erciyes University, Kayseri, Turkey
| | - Karel Hron
- Department of Mathematical Analysis and Applications of Mathematics, Palacký University, Olomouc, Czechia
| | | | - Marta B. Lopes
- NOVA Laboratory for Computer Science and Informatics (NOVA LINCS), FCT, UNL, Caparica, Portugal
- Centro de Matemática e Aplicações (CMA), FCT, UNL, Caparica, Portugal
| | - Laura Judith Marcos-Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | - Cláudia Marques
- CINTESIS, NOVA Medical School, NMS, Universidade Nova de Lisboa, Lisbon, Portugal
| | - Michael Mason
- Computational Oncology, Sage Bionetworks, Seattle, WA, United States
| | - Patrick May
- Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Lejla Pašić
- Sarajevo Medical School, University Sarajevo School of Science and Technology, Sarajevo, Bosnia and Herzegovina
| | - Gianvito Pio
- Department of Computer Science, University of Bari Aldo Moro, Bari, Italy
| | - Sándor Pongor
- Faculty of Information Tehnology and Bionics, Pázmány University, Budapest, Hungary
| | - Vasilis J. Promponas
- Bioinformatics Research Laboratory, Department of Biological Sciences, University of Cyprus, Nicosia, Cyprus
| | - Piotr Przymus
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruñ, Poland
| | - Julio Saez-Rodriguez
- Institute of Computational Biomedicine, Heidelberg University, Faculty of Medicine and Heidelberg University Hospital, Heidelberg, Germany
| | - Alexia Sampri
- Division of Informatics, Imaging and Data Sciences, School of Health Sciences, University of Manchester, Manchester, United Kingdom
| | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Blaz Stres
- Jozef Stefan Institute, Ljubljana, Slovenia
- Biotechnical Faculty, University of Ljubljana, Ljubljana, Slovenia
- Faculty of Civil and Geodetic Engineering, University of Ljubljana, Ljubljana, Slovenia
| | - Ramona Suharoschi
- Molecular Nutrition and Proteomics Lab, Faculty of the Food Science and Technology, Institute of Life Sciences, University of Agricultural Sciences and Veterinary Medicine of Cluj-Napoca, Cluj-Napoca, Romania
| | - Jaak Truu
- Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | - Ciprian-Octavian Truică
- Department of Computer Science and Engineering, Faculty of Automatic Control and Computers, University Politehnica of Bucharest, Bucharest, Romania
| | - Baiba Vilne
- Bioinformatics Research Unit, Riga Stradins University, Riga, Latvia
| | - Dimitrios Vlachakis
- Laboratory of Genetics, Department of Biotechnology, School of Applied Biology and Biotechnology, Agricultural University of Athens, Athens, Greece
| | - Ercument Yilmaz
- Department of Computer Technologies, Karadeniz Technical University, Trabzon, Turkey
| | - Georg Zeller
- European Molecular Biology Laboratory, Structural and Computational Biology Unit, Heidelberg, Germany
| | - Aldert L. Zomer
- Department of Infectious Diseases and Immunology, Faculty of Veterinary Medicine, Utrecht University, Utrecht, Netherlands
| | - David Gómez-Cabrero
- Navarrabiomed, Complejo Hospitalario de Navarra (CHN), IdiSNA, Universidad Pública de Navarra (UPNA), Pamplona, Spain
| | - Marcus J. Claesson
- School of Microbiology and APC Microbiome Ireland, University College Cork, Cork, Ireland
| |
Collapse
|
29
|
Song K, Wright FA, Zhou YH. Systematic Comparisons for Composition Profiles, Taxonomic Levels, and Machine Learning Methods for Microbiome-Based Disease Prediction. Front Mol Biosci 2020; 7:610845. [PMID: 33392266 PMCID: PMC7772236 DOI: 10.3389/fmolb.2020.610845] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2020] [Accepted: 11/25/2020] [Indexed: 12/12/2022] Open
Abstract
Microbiome composition profiles generated from 16S rRNA sequencing have been extensively studied for their usefulness in phenotype trait prediction, including for complex diseases such as diabetes and obesity. These microbiome compositions have typically been quantified in the form of Operational Taxonomic Unit (OTU) count matrices. However, alternate approaches such as Amplicon Sequence Variants (ASV) have been used, as well as the direct use of k-mer sequence counts. The overall effect of these different types of predictors when used in concert with various machine learning methods has been difficult to assess, due to varied combinations described in the literature. Here we provide an in-depth investigation of more than 1,000 combinations of these three clustering/counting methods, in combination with varied choices for normalization and filtering, grouping at various taxonomic levels, and the use of more than ten commonly used machine learning methods for phenotype prediction. The use of short k-mers, which have computational advantages and conceptual simplicity, is shown to be effective as a source for microbiome-based prediction. Among machine-learning approaches, tree-based methods show consistent, though modest, advantages in prediction accuracy. We describe the various advantages and disadvantages of combinations in analysis approaches, and provide general observations to serve as a useful guide for future trait-prediction explorations using microbiome data.
Collapse
Affiliation(s)
- Kuncheng Song
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, United States
| | - Fred A Wright
- Departments of Statistics and Biological Sciences, North Carolina State University, Raleigh, NC, United States
| | - Yi-Hui Zhou
- Department of Biological Sciences, North Carolina State University, Raleigh, NC, United States
| |
Collapse
|
30
|
Asgari E, Münch PC, Lesker TR, McHardy AC, Mofrad MRK. DiTaxa: nucleotide-pair encoding of 16S rRNA for host phenotype and biomarker detection. Bioinformatics 2020; 35:2498-2500. [PMID: 30500871 DOI: 10.1093/bioinformatics/bty954] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2018] [Revised: 09/21/2018] [Accepted: 11/28/2018] [Indexed: 11/14/2022] Open
Abstract
SUMMARY Identifying distinctive taxa for micro-biome-related diseases is considered key to the establishment of diagnosis and therapy options in precision medicine and imposes high demands on the accuracy of micro-biome analysis techniques. We propose an alignment- and reference- free subsequence based 16S rRNA data analysis, as a new paradigm for micro-biome phenotype and biomarker detection. Our method, called DiTaxa, substitutes standard operational taxonomic unit (OTU)-clustering by segmenting 16S rRNA reads into the most frequent variable-length subsequences. We compared the performance of DiTaxa to the state-of-the-art methods in phenotype and biomarker detection, using human-associated 16S rRNA samples for periodontal disease, rheumatoid arthritis and inflammatory bowel diseases, as well as a synthetic benchmark dataset. DiTaxa performed competitively to the k-mer based state-of-the-art approach in phenotype prediction while outperforming the OTU-based state-of-the-art approach in finding biomarkers in both resolution and coverage evaluated over known links from literature and synthetic benchmark datasets. AVAILABILITY AND IMPLEMENTATION DiTaxa is available under the Apache 2 license at http://llp.berkeley.edu/ditaxa. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ehsaneddin Asgari
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, CA, USA.,Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Brunswick, Germany
| | - Philipp C Münch
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Brunswick, Germany.,Faculty of Medicine, LMU Munich, Max von Pettenkofer-Institute of Hygiene and Medical Microbiology, Munich, Germany
| | - Till R Lesker
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Brunswick, Germany
| | - Alice C McHardy
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Brunswick, Germany
| | - Mohammad R K Mofrad
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, CA, USA.,Molecular Biophysics and Integrated Bioimaging, Lawrence Berkeley National Lab, Berkeley, CA, USA
| |
Collapse
|
31
|
Xia Y. Correlation and association analyses in microbiome study integrating multiomics in health and disease. PROGRESS IN MOLECULAR BIOLOGY AND TRANSLATIONAL SCIENCE 2020; 171:309-491. [PMID: 32475527 DOI: 10.1016/bs.pmbts.2020.04.003] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Correlation and association analyses are one of the most widely used statistical methods in research fields, including microbiome and integrative multiomics studies. Correlation and association have two implications: dependence and co-occurrence. Microbiome data are structured as phylogenetic tree and have several unique characteristics, including high dimensionality, compositionality, sparsity with excess zeros, and heterogeneity. These unique characteristics cause several statistical issues when analyzing microbiome data and integrating multiomics data, such as large p and small n, dependency, overdispersion, and zero-inflation. In microbiome research, on the one hand, classic correlation and association methods are still applied in real studies and used for the development of new methods; on the other hand, new methods have been developed to target statistical issues arising from unique characteristics of microbiome data. Here, we first provide a comprehensive view of classic and newly developed univariate correlation and association-based methods. We discuss the appropriateness and limitations of using classic methods and demonstrate how the newly developed methods mitigate the issues of microbiome data. Second, we emphasize that concepts of correlation and association analyses have been shifted by introducing network analysis, microbe-metabolite interactions, functional analysis, etc. Third, we introduce multivariate correlation and association-based methods, which are organized by the categories of exploratory, interpretive, and discriminatory analyses and classification methods. Fourth, we focus on the hypothesis testing of univariate and multivariate regression-based association methods, including alpha and beta diversities-based, count-based, and relative abundance (or compositional)-based association analyses. We demonstrate the characteristics and limitations of each approaches. Fifth, we introduce two specific microbiome-based methods: phylogenetic tree-based association analysis and testing for survival outcomes. Sixth, we provide an overall view of longitudinal methods in analysis of microbiome and omics data, which cover standard, static, regression-based time series methods, principal trend analysis, and newly developed univariate overdispersed and zero-inflated as well as multivariate distance/kernel-based longitudinal models. Finally, we comment on current association analysis and future direction of association analysis in microbiome and multiomics studies.
Collapse
Affiliation(s)
- Yinglin Xia
- Department of Medicine, University of Illinois at Chicago, Chicago, IL, United States.
| |
Collapse
|
32
|
Seneviratne CJ, Balan P, Suriyanarayanan T, Lakshmanan M, Lee DY, Rho M, Jakubovics N, Brandt B, Crielaard W, Zaura E. Oral microbiome-systemic link studies: perspectives on current limitations and future artificial intelligence-based approaches. Crit Rev Microbiol 2020; 46:288-299. [PMID: 32434436 DOI: 10.1080/1040841x.2020.1766414] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
In the past decade, there has been a tremendous increase in studies on the link between oral microbiome and systemic diseases. However, variations in study design and confounding variables across studies often lead to inconsistent observations. In this narrative review, we have discussed the potential influence of study design and confounding variables on the current sequencing-based oral microbiome-systemic disease link studies. The current limitations of oral microbiome-systemic link studies on type 2 diabetes mellitus, rheumatoid arthritis, pregnancy, atherosclerosis, and pancreatic cancer are discussed in this review, followed by our perspective on how artificial intelligence (AI), particularly machine learning and deep learning approaches, can be employed for predicting systemic disease and host metadata from the oral microbiome. The application of AI for predicting systemic disease as well as host metadata requires the establishment of a global database repository with microbiome sequences and annotated host metadata. However, this task requires collective efforts from researchers working in the field of oral microbiome to establish more comprehensive datasets with appropriate host metadata. Development of AI-based models by incorporating consistent host metadata will allow prediction of systemic diseases with higher accuracies, bringing considerable clinical benefits.
Collapse
Affiliation(s)
- Chaminda Jayampath Seneviratne
- Singapore Oral Microbiomics Initiative (SOMI), National Dental Research Institute Singapore, National Dental Centre Singapore, Duke NUS Medical School, Singapore, Singapore
| | - Preethi Balan
- Singapore Oral Microbiomics Initiative (SOMI), National Dental Research Institute Singapore, National Dental Centre Singapore, Duke NUS Medical School, Singapore, Singapore
| | - Tanujaa Suriyanarayanan
- Singapore Oral Microbiomics Initiative (SOMI), National Dental Research Institute Singapore, National Dental Centre Singapore, Duke NUS Medical School, Singapore, Singapore
| | - Meiyappan Lakshmanan
- Bioprocessing Technology Institute (BTI), ASTAR - Agency for Science, Technology and Research, Singapore, Singapore
| | - Dong-Yup Lee
- Bioprocessing Technology Institute (BTI), ASTAR - Agency for Science, Technology and Research, Singapore, Singapore.,School of Chemical Engineering, Sungkyunkwan University, Jongno-gu, Republic of Korea
| | - Mina Rho
- Departments of Computer Science and Engineering & Biomedical Informatics, Hanyang University, Seoul, Korea
| | - Nicholas Jakubovics
- Oral Biology, School of Dental Sciences, Newcastle University, Newcastle upon Tyne, UK
| | - Bernd Brandt
- Department of Preventive Dentistry, Academic Centre for Dentistry Amsterdam, University of Amsterdam and Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
| | - Wim Crielaard
- Department of Preventive Dentistry, Academic Centre for Dentistry Amsterdam, University of Amsterdam and Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
| | - Egija Zaura
- Department of Preventive Dentistry, Academic Centre for Dentistry Amsterdam, University of Amsterdam and Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
| |
Collapse
|
33
|
Brester C, Ryzhikov I, Siponen S, Jayaprakash B, Ikonen J, Pitkänen T, Miettinen IT, Torvinen E, Kolehmainen M. Potential and limitations of a pilot-scale drinking water distribution system for bacterial community predictive modelling. THE SCIENCE OF THE TOTAL ENVIRONMENT 2020; 717:137249. [PMID: 32092807 DOI: 10.1016/j.scitotenv.2020.137249] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/10/2019] [Revised: 02/09/2020] [Accepted: 02/09/2020] [Indexed: 06/10/2023]
Abstract
Waterborne disease outbreaks are a persistent and serious threat to public health according to reported incidents across the globe. Online drinking water quality monitoring technologies have evolved substantially and have become more accurate and accessible. However, using online measurements alone is unsuitable for detecting microbial regrowth, potentially including harmful species, ahead of time in the distribution systems. Alternatively, observational data could be collected periodically, e.g. once per week or once per month and it could include a representative set of variables: physicochemical water characteristics, disinfectant concentrations, and bacterial abundances, which would be a valuable source of knowledge for predictive modelling that aims to reveal pathogen-related threats. In this study, we utilised data collected from a pilot-scale drinking water distribution system. A data-driven random forest model was used for predictive modelling and was trained for nowcasting and forecasting abundances of bacterial groups. In all the experiments, we followed the realistic crossline scenario, which means that when training and testing the models the data is collected from different pipelines. In spite of the more accurate results of the nowcasting, the 1-week forecasting still provided accurate predictions of the most abundant bacteria, their rapid increase and decrease. In the future predictive modelling might be used as a tool in designing control measures for opportunistic pathogens which are able to multiply in the favourable conditions in drinking water distribution systems (DWDS). Eventually, the forecasting information will be able to produce practically helpful data for controlling the DWDS regrowth.
Collapse
Affiliation(s)
- Christina Brester
- Department of Environmental and Biological Sciences, University of Eastern Finland, P.O. Box 1627, FI-70211 Kuopio, Finland.
| | - Ivan Ryzhikov
- Department of Environmental and Biological Sciences, University of Eastern Finland, P.O. Box 1627, FI-70211 Kuopio, Finland
| | - Sallamaari Siponen
- Department of Environmental and Biological Sciences, University of Eastern Finland, P.O. Box 1627, FI-70211 Kuopio, Finland
| | - Balamuralikrishna Jayaprakash
- Department of Health Security, Expert Microbiology Unit, National Institute for Health and Welfare, P.O. Box 95, FI-70701 Kuopio, Finland
| | - Jenni Ikonen
- Department of Health Security, Expert Microbiology Unit, National Institute for Health and Welfare, P.O. Box 95, FI-70701 Kuopio, Finland
| | - Tarja Pitkänen
- Department of Health Security, Expert Microbiology Unit, National Institute for Health and Welfare, P.O. Box 95, FI-70701 Kuopio, Finland
| | - Ilkka T Miettinen
- Department of Health Security, Expert Microbiology Unit, National Institute for Health and Welfare, P.O. Box 95, FI-70701 Kuopio, Finland
| | - Eila Torvinen
- Department of Environmental and Biological Sciences, University of Eastern Finland, P.O. Box 1627, FI-70211 Kuopio, Finland
| | - Mikko Kolehmainen
- Department of Environmental and Biological Sciences, University of Eastern Finland, P.O. Box 1627, FI-70211 Kuopio, Finland
| |
Collapse
|
34
|
Jin S, Zeng X, Xia F, Huang W, Liu X. Application of deep learning methods in biological networks. Brief Bioinform 2020; 22:1902-1917. [PMID: 32363401 DOI: 10.1093/bib/bbaa043] [Citation(s) in RCA: 84] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2019] [Revised: 02/19/2020] [Accepted: 03/05/2020] [Indexed: 01/07/2023] Open
Abstract
The increase in biological data and the formation of various biomolecule interaction databases enable us to obtain diverse biological networks. These biological networks provide a wealth of raw materials for further understanding of biological systems, the discovery of complex diseases and the search for therapeutic drugs. However, the increase in data also increases the difficulty of biological networks analysis. Therefore, algorithms that can handle large, heterogeneous and complex data are needed to better analyze the data of these network structures and mine their useful information. Deep learning is a branch of machine learning that extracts more abstract features from a larger set of training data. Through the establishment of an artificial neural network with a network hierarchy structure, deep learning can extract and screen the input information layer by layer and has representation learning ability. The improved deep learning algorithm can be used to process complex and heterogeneous graph data structures and is increasingly being applied to the mining of network data information. In this paper, we first introduce the used network data deep learning models. After words, we summarize the application of deep learning on biological networks. Finally, we discuss the future development prospects of this field.
Collapse
|
35
|
Machine learning methods for microbiome studies. J Microbiol 2020; 58:206-216. [DOI: 10.1007/s12275-020-0066-8] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2020] [Revised: 02/17/2020] [Accepted: 02/17/2020] [Indexed: 12/12/2022]
|
36
|
LaPierre N, Ju CJT, Zhou G, Wang W. MetaPheno: A critical evaluation of deep learning and machine learning in metagenome-based disease prediction. Methods 2019; 166:74-82. [PMID: 30885720 PMCID: PMC6708502 DOI: 10.1016/j.ymeth.2019.03.003] [Citation(s) in RCA: 54] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2018] [Revised: 02/14/2019] [Accepted: 03/04/2019] [Indexed: 01/21/2023] Open
Abstract
The human microbiome plays a number of critical roles, impacting almost every aspect of human health and well-being. Conditions in the microbiome have been linked to a number of significant diseases. Additionally, revolutions in sequencing technology have led to a rapid increase in publicly-available sequencing data. Consequently, there have been growing efforts to predict disease status from metagenomic sequencing data, with a proliferation of new approaches in the last few years. Some of these efforts have explored utilizing a powerful form of machine learning called deep learning, which has been applied successfully in several biological domains. Here, we review some of these methods and the algorithms that they are based on, with a particular focus on deep learning methods. We also perform a deeper analysis of Type 2 Diabetes and obesity datasets that have eluded improved results, using a variety of machine learning and feature extraction methods. We conclude by offering perspectives on study design considerations that may impact results and future directions the field can take to improve results and offer more valuable conclusions. The scripts and extracted features for the analyses conducted in this paper are available via GitHub:https://github.com/nlapier2/metapheno.
Collapse
Affiliation(s)
- Nathan LaPierre
- Department of Computer Science, University of California at Los Angeles, Los Angeles, CA 90095, USA
| | - Chelsea J-T Ju
- Department of Computer Science, University of California at Los Angeles, Los Angeles, CA 90095, USA
| | - Guangyu Zhou
- Department of Computer Science, University of California at Los Angeles, Los Angeles, CA 90095, USA
| | - Wei Wang
- Department of Computer Science, University of California at Los Angeles, Los Angeles, CA 90095, USA.
| |
Collapse
|
37
|
Zhou YH, Gallins P. A Review and Tutorial of Machine Learning Methods for Microbiome Host Trait Prediction. Front Genet 2019; 10:579. [PMID: 31293616 PMCID: PMC6603228 DOI: 10.3389/fgene.2019.00579] [Citation(s) in RCA: 95] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2019] [Accepted: 06/04/2019] [Indexed: 12/19/2022] Open
Abstract
With the growing importance of microbiome research, there is increasing evidence that host variation in microbial communities is associated with overall host health. Advancement in genetic sequencing methods for microbiomes has coincided with improvements in machine learning, with important implications for disease risk prediction in humans. One aspect specific to microbiome prediction is the use of taxonomy-informed feature selection. In this review for non-experts, we explore the most commonly used machine learning methods, and evaluate their prediction accuracy as applied to microbiome host trait prediction. Methods are described at an introductory level, and R/Python code for the analyses is provided.
Collapse
Affiliation(s)
- Yi-Hui Zhou
- Department of Biological Sciences, North Carolina State University, Raleigh, NC, United States
| | - Paul Gallins
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, United States
| |
Collapse
|
38
|
Qu K, Guo F, Liu X, Lin Y, Zou Q. Application of Machine Learning in Microbiology. Front Microbiol 2019; 10:827. [PMID: 31057526 PMCID: PMC6482238 DOI: 10.3389/fmicb.2019.00827] [Citation(s) in RCA: 95] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2019] [Accepted: 04/01/2019] [Indexed: 02/01/2023] Open
Abstract
Microorganisms are ubiquitous and closely related to people's daily lives. Since they were first discovered in the 19th century, researchers have shown great interest in microorganisms. People studied microorganisms through cultivation, but this method is expensive and time consuming. However, the cultivation method cannot keep a pace with the development of high-throughput sequencing technology. To deal with this problem, machine learning (ML) methods have been widely applied to the field of microbiology. Literature reviews have shown that ML can be used in many aspects of microbiology research, especially classification problems, and for exploring the interaction between microorganisms and the surrounding environment. In this study, we summarize the application of ML in microbiology.
Collapse
Affiliation(s)
- Kaiyang Qu
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Xiangrong Liu
- School of Information Science and Technology, Xiamen University, Xiamen, China
| | - Yuan Lin
- School of Information Science and Technology, Xiamen University, Xiamen, China
- Department of System Integration, Sparebanken Vest, Bergen, Norway
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
39
|
Madani A, Bakhaty A, Kim J, Mubarak Y, Mofrad M. Bridging finite element and machine learning modeling: stress prediction of arterial walls in atherosclerosis. J Biomech Eng 2019; 141:2729617. [PMID: 30912802 DOI: 10.1115/1.4043290] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2018] [Indexed: 11/08/2022]
Abstract
Finite element and machine learning modeling are two predictive paradigms that have rarely been bridged. In this study, we develop a parametric model to generate arterial geometries and accumulate a database of over 12,000 finite element simulations of mechanical behaviour and stress distribution in these arterial models representative of atherosclerotic plaques. We formulate the training data to predict the maximum von Mises stress which could indicate risk of plaque rupture. Trained deep learning models are able to accurately predict the max von Mises stress within 9.86% error on a held-out test set. The deep neural networks outperform alternative prediction models and performance scales with amount of training data. Lastly, we examine the importance of attributing features on stress value and location prediction to gain intuitions on the underlying process. Moreover, deep neural networks can capture the functional mapping described by the finite element method which has far-reaching implications for real-time and multi-scale prediction tasks in biomechanics.
Collapse
Affiliation(s)
- Ali Madani
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California, United States of America
| | - Ahmed Bakhaty
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California, United States of America; Department of Civil Engineering, University of California, Berkeley, California, United States of America
| | - Jiwon Kim
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California, United States of America; Department of Electrical Engineering and Computer Science, University of California, Berkeley, California, United States of America
| | - Yara Mubarak
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California, United States of America; Department of Civil Engineering, University of California, Berkeley, California, United States of America
| | - Mohammad Mofrad
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California, United States of America; Molecular Biophysics and Integrative Bioimaging Division, Lawrence Berkeley National Lab, Berkeley, California, United States of America
| |
Collapse
|
40
|
Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX). Sci Rep 2019; 9:3577. [PMID: 30837494 PMCID: PMC6401088 DOI: 10.1038/s41598-019-38746-w] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Accepted: 12/19/2018] [Indexed: 12/28/2022] Open
Abstract
In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.
Collapse
|
41
|
Woloszynek S, Zhao Z, Chen J, Rosen GL. 16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses. PLoS Comput Biol 2019; 15:e1006721. [PMID: 30807567 PMCID: PMC6407789 DOI: 10.1371/journal.pcbi.1006721] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2018] [Revised: 03/08/2019] [Accepted: 12/17/2018] [Indexed: 12/26/2022] Open
Abstract
Advances in high-throughput sequencing have increased the availability of microbiome sequencing data that can be exploited to characterize microbiome community structure in situ. We explore using word and sentence embedding approaches for nucleotide sequences since they may be a suitable numerical representation for downstream machine learning applications (especially deep learning). This work involves first encoding ("embedding") each sequence into a dense, low-dimensional, numeric vector space. Here, we use Skip-Gram word2vec to embed k-mers, obtained from 16S rRNA amplicon surveys, and then leverage an existing sentence embedding technique to embed all sequences belonging to specific body sites or samples. We demonstrate that these representations are meaningful, and hence the embedding space can be exploited as a form of feature extraction for exploratory analysis. We show that sequence embeddings preserve relevant information about the sequencing data such as k-mer context, sequence taxonomy, and sample class. Specifically, the sequence embedding space resolved differences among phyla, as well as differences among genera within the same family. Distances between sequence embeddings had similar qualities to distances between alignment identities, and embedding multiple sequences can be thought of as generating a consensus sequence. In addition, embeddings are versatile features that can be used for many downstream tasks, such as taxonomic and sample classification. Using sample embeddings for body site classification resulted in negligible performance loss compared to using OTU abundance data, and clustering embeddings yielded high fidelity species clusters. Lastly, the k-mer embedding space captured distinct k-mer profiles that mapped to specific regions of the 16S rRNA gene and corresponded with particular body sites. Together, our results show that embedding sequences results in meaningful representations that can be used for exploratory analyses or for downstream machine learning applications that require numeric data. Moreover, because the embeddings are trained in an unsupervised manner, unlabeled data can be embedded and used to bolster supervised machine learning tasks.
Collapse
Affiliation(s)
- Stephen Woloszynek
- Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Zhengqiao Zhao
- Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Jian Chen
- Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, New York, United States of America
| | - Gail L. Rosen
- Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America
| |
Collapse
|
42
|
McCord BR, Gauthier Q, Cho S, Roig MN, Gibson-Daw GC, Young B, Taglia F, Zapico SC, Mariot RF, Lee SB, Duncan G. Forensic DNA Analysis. Anal Chem 2019; 91:673-688. [PMID: 30485738 DOI: 10.1021/acs.analchem.8b05318] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Bruce R McCord
- Department of Chemistry , Florida International University , Miami , Florida 33199 , United States
| | - Quentin Gauthier
- Department of Chemistry , Florida International University , Miami , Florida 33199 , United States
| | - Sohee Cho
- Department of Forensic Medicine , Seoul National University , Seoul , 08826 , South Korea
| | - Meghan N Roig
- Department of Chemistry , Florida International University , Miami , Florida 33199 , United States
| | - Georgiana C Gibson-Daw
- Department of Chemistry , Florida International University , Miami , Florida 33199 , United States
| | - Brian Young
- Niche Vision, Inc. , Akron , Ohio 44311 , United States
| | - Fabiana Taglia
- Department of Chemistry , Florida International University , Miami , Florida 33199 , United States
| | - Sara C Zapico
- Department of Chemistry , Florida International University , Miami , Florida 33199 , United States
| | - Roberta Fogliatto Mariot
- Department of Chemistry , Florida International University , Miami , Florida 33199 , United States
| | - Steven B Lee
- Forensic Science Program, Justice Studies Department , San Jose State University , San Jose , California 95192 , United States
| | - George Duncan
- Department of Chemistry , Florida International University , Miami , Florida 33199 , United States
| |
Collapse
|