1
|
Xu H, Wang T, Miao Y, Qian M, Yang Y, Wang S. MK-BMC: a Multi-Kernel framework with Boosted distance metrics for Microbiome data for Classification. Bioinformatics 2024; 40:btad757. [PMID: 38200571 PMCID: PMC10789312 DOI: 10.1093/bioinformatics/btad757] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 10/30/2023] [Accepted: 01/09/2024] [Indexed: 01/12/2024] Open
Abstract
MOTIVATION Research on human microbiome has suggested associations with human health, opening opportunities to predict health outcomes using microbiome. Studies have also suggested that diverse forms of taxa such as rare taxa that are evolutionally related and abundant taxa that are evolutionally unrelated could be associated with or predictive of a health outcome. Although prediction models were developed for microbiome data, no prediction models currently exist that use multiple forms of microbiome-outcome associations. RESULTS We developed MK-BMC, a Multi-Kernel framework with Boosted distance Metrics for Classification using microbiome data. We propose to first boost widely used distance metrics for microbiome data using taxon-level association signal strengths to up-weight taxa that are potentially associated with an outcome of interest. We then propose a multi-kernel prediction model with one kernel capturing one form of association between taxa and the outcome, where a kernel measures similarities of microbiome compositions between pairs of samples being transformed from a proposed boosted distance metric. We demonstrated superior prediction performance of (i) boosted distance metrics for microbiome data over original ones and (ii) MK-BMC over competing methods through extensive simulations. We applied MK-BMC to predict thyroid, obesity, and inflammatory bowel disease status using gut microbiome data from the American Gut Project and observed much-improved prediction performance over that of competing methods. The learned kernel weights help us understand contributions of individual microbiome signal forms nicely. AVAILABILITY AND IMPLEMENTATION Source code together with a sample input dataset is available at https://github.com/HXu06/MK-BMC.
Collapse
Affiliation(s)
- Huang Xu
- Department of Statistics and Finance, University of Science and Technology of China, Hefei 230026, China
| | - Tian Wang
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY 10032, United States
| | - Yuqi Miao
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY 10032, United States
| | - Min Qian
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY 10032, United States
| | - Yaning Yang
- Department of Statistics and Finance, University of Science and Technology of China, Hefei 230026, China
| | - Shuang Wang
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY 10032, United States
| |
Collapse
|
2
|
Regueira-Iglesias A, Balsa-Castro C, Blanco-Pintos T, Tomás I. Critical review of 16S rRNA gene sequencing workflow in microbiome studies: From primer selection to advanced data analysis. Mol Oral Microbiol 2023; 38:347-399. [PMID: 37804481 DOI: 10.1111/omi.12434] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 09/01/2023] [Accepted: 09/14/2023] [Indexed: 10/09/2023]
Abstract
The multi-batch reanalysis approach of jointly reevaluating gene/genome sequences from different works has gained particular relevance in the literature in recent years. The large amount of 16S ribosomal ribonucleic acid (rRNA) gene sequence data stored in public repositories and information in taxonomic databases of the same gene far exceeds that related to complete genomes. This review is intended to guide researchers new to studying microbiota, particularly the oral microbiota, using 16S rRNA gene sequencing and those who want to expand and update their knowledge to optimise their decision-making and improve their research results. First, we describe the advantages and disadvantages of using the 16S rRNA gene as a phylogenetic marker and the latest findings on the impact of primer pair selection on diversity and taxonomic assignment outcomes in oral microbiome studies. Strategies for primer selection based on these results are introduced. Second, we identified the key factors to consider in selecting the sequencing technology and platform. The process and particularities of the main steps for processing 16S rRNA gene-derived data are described in detail to enable researchers to choose the most appropriate bioinformatics pipeline and analysis methods based on the available evidence. We then produce an overview of the different types of advanced analyses, both the most widely used in the literature and the most recent approaches. Several indices, metrics and software for studying microbial communities are included, highlighting their advantages and disadvantages. Considering the principles of clinical metagenomics, we conclude that future research should focus on rigorous analytical approaches, such as developing predictive models to identify microbiome-based biomarkers to classify health and disease states. Finally, we address the batch effect concept and the microbiome-specific methods for accounting for or correcting them.
Collapse
Affiliation(s)
- Alba Regueira-Iglesias
- Oral Sciences Research Group, Special Needs Unit, Department of Surgery and Medical-Surgical Specialties, School of Medicine and Dentistry, Universidade de Santiago de Compostela, Health Research Institute of Santiago de Compostela (IDIS), Santiago de Compostela, A Coruña, Spain
| | - Carlos Balsa-Castro
- Oral Sciences Research Group, Special Needs Unit, Department of Surgery and Medical-Surgical Specialties, School of Medicine and Dentistry, Universidade de Santiago de Compostela, Health Research Institute of Santiago de Compostela (IDIS), Santiago de Compostela, A Coruña, Spain
| | - Triana Blanco-Pintos
- Oral Sciences Research Group, Special Needs Unit, Department of Surgery and Medical-Surgical Specialties, School of Medicine and Dentistry, Universidade de Santiago de Compostela, Health Research Institute of Santiago de Compostela (IDIS), Santiago de Compostela, A Coruña, Spain
| | - Inmaculada Tomás
- Oral Sciences Research Group, Special Needs Unit, Department of Surgery and Medical-Surgical Specialties, School of Medicine and Dentistry, Universidade de Santiago de Compostela, Health Research Institute of Santiago de Compostela (IDIS), Santiago de Compostela, A Coruña, Spain
| |
Collapse
|
3
|
Li B, Wang T, Qian M, Wang S. MKMR: a multi-kernel machine regression model to predict health outcomes using human microbiome data. Brief Bioinform 2023; 24:7142722. [PMID: 37099694 DOI: 10.1093/bib/bbad158] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Revised: 03/24/2023] [Accepted: 04/03/2023] [Indexed: 04/28/2023] Open
Abstract
Studies have found that human microbiome is associated with and predictive of human health and diseases. Many statistical methods developed for microbiome data focus on different distance metrics that can capture various information in microbiomes. Prediction models were also developed for microbiome data, including deep learning methods with convolutional neural networks that consider both taxa abundance profiles and taxonomic relationships among microbial taxa from a phylogenetic tree. Studies have also suggested that a health outcome could associate with multiple forms of microbiome profiles. In addition to the abundance of some taxa that are associated with a health outcome, the presence/absence of some taxa is also associated with and predictive of the same health outcome. Moreover, associated taxa may be close to each other on a phylogenetic tree or spread apart on a phylogenetic tree. No prediction models currently exist that use multiple forms of microbiome-outcome associations. To address this, we propose a multi-kernel machine regression (MKMR) method that is able to capture various types of microbiome signals when doing predictions. MKMR utilizes multiple forms of microbiome signals through multiple kernels being transformed from multiple distance metrics for microbiomes and learn an optimal conic combination of these kernels, with kernel weights helping us understand contributions of individual microbiome signal types. Simulation studies suggest a much-improved prediction performance over competing methods with mixture of microbiome signals. Real data applicants to predict multiple health outcomes using throat and gut microbiome data also suggest a better prediction of MKMR than that of competing methods.
Collapse
Affiliation(s)
- Bing Li
- Department of Biostatistics, School of Public Health, Brown University, Providence, Rhode Island, U.S.A
| | - Tian Wang
- Department of Biostatistics, Mailman School of Public Health, Columbia University, 722 West 168th Street, New York, New York, 10032 U.S.A
| | - Min Qian
- Department of Biostatistics, Mailman School of Public Health, Columbia University, 722 West 168th Street, New York, New York, 10032 U.S.A
| | - Shuang Wang
- Department of Biostatistics, School of Public Health, Brown University, Providence, Rhode Island, U.S.A
| |
Collapse
|
4
|
Yang L, Chen J. Benchmarking differential abundance analysis methods for correlated microbiome sequencing data. Brief Bioinform 2023; 24:bbac607. [PMID: 36617187 PMCID: PMC9851339 DOI: 10.1093/bib/bbac607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Revised: 11/16/2022] [Accepted: 12/10/2022] [Indexed: 01/09/2023] Open
Abstract
Differential abundance analysis (DAA) is one central statistical task in microbiome data analysis. A robust and powerful DAA tool can help identify highly confident microbial candidates for further biological validation. Current microbiome studies frequently generate correlated samples from different microbiome sampling schemes such as spatial and temporal sampling. In the past decade, a number of DAA tools for correlated microbiome data (DAA-c) have been proposed. Disturbingly, different DAA-c tools could sometimes produce quite discordant results. To recommend the best practice to the field, we performed the first comprehensive evaluation of existing DAA-c tools using real data-based simulations. Overall, the linear model-based methods LinDA, MaAsLin2 and LDM are more robust than methods based on generalized linear models. The LinDA method is the only method that maintains reasonable performance in the presence of strong compositional effects.
Collapse
Affiliation(s)
- Lu Yang
- Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55901, USA
| | - Jun Chen
- Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55901, USA
| |
Collapse
|
5
|
Yang L, Chen J. A comprehensive evaluation of microbial differential abundance analysis methods: current status and potential solutions. MICROBIOME 2022; 10:130. [PMID: 35986393 PMCID: PMC9392415 DOI: 10.1186/s40168-022-01320-0] [Citation(s) in RCA: 38] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Accepted: 07/04/2022] [Indexed: 06/12/2023]
Abstract
BACKGROUND Differential abundance analysis (DAA) is one central statistical task in microbiome data analysis. A robust and powerful DAA tool can help identify highly confident microbial candidates for further biological validation. Numerous DAA tools have been proposed in the past decade addressing the special characteristics of microbiome data such as zero inflation and compositional effects. Disturbingly, different DAA tools could sometimes produce quite discordant results, opening to the possibility of cherry-picking the tool in favor of one's own hypothesis. To recommend the best DAA tool or practice to the field, a comprehensive evaluation, which covers as many biologically relevant scenarios as possible, is critically needed. RESULTS We performed by far the most comprehensive evaluation of existing DAA tools using real data-based simulations. We found that DAA methods explicitly addressing compositional effects such as ANCOM-BC, Aldex2, metagenomeSeq (fitFeatureModel), and DACOMP did have improved performance in false-positive control. But they are still not optimal: type 1 error inflation or low statistical power has been observed in many settings. The recent LDM method generally had the best power, but its false-positive control in the presence of strong compositional effects was not satisfactory. Overall, none of the evaluated methods is simultaneously robust, powerful, and flexible, which makes the selection of the best DAA tool difficult. To meet the analysis needs, we designed an optimized procedure, ZicoSeq, drawing on the strength of the existing DAA methods. We show that ZicoSeq generally controlled for false positives across settings, and the power was among the highest. Application of DAA methods to a large collection of real datasets revealed a similar pattern observed in simulation studies. CONCLUSIONS Based on the benchmarking study, we conclude that none of the existing DAA methods evaluated can be applied blindly to any real microbiome dataset. The applicability of an existing DAA method depends on specific settings, which are usually unknown a priori. To circumvent the difficulty of selecting the best DAA tool in practice, we design ZicoSeq, which addresses the major challenges in DAA and remedies the drawbacks of existing DAA methods. ZicoSeq can be applied to microbiome datasets from diverse settings and is a useful DAA tool for robust microbiome biomarker discovery. Video Abstract.
Collapse
Affiliation(s)
- Lu Yang
- Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, 55905, USA
- Center for Individualized Medicine, Mayo Clinic, Rochester, MN, 55905, USA
| | - Jun Chen
- Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, 55905, USA.
- Center for Individualized Medicine, Mayo Clinic, Rochester, MN, 55905, USA.
| |
Collapse
|
6
|
Huang C, Callahan BJ, Wu MC, Holloway ST, Brochu H, Lu W, Peng X, Tzeng JY. Phylogeny-guided microbiome OTU-specific association test (POST). MICROBIOME 2022; 10:86. [PMID: 35668471 PMCID: PMC9171974 DOI: 10.1186/s40168-022-01266-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/25/2021] [Accepted: 04/01/2022] [Indexed: 06/15/2023]
Abstract
BACKGROUND The relationship between host conditions and microbiome profiles, typically characterized by operational taxonomic units (OTUs), contains important information about the microbial role in human health. Traditional association testing frameworks are challenged by the high dimensionality and sparsity of typical microbiome profiles. Phylogenetic information is often incorporated to address these challenges with the assumption that evolutionarily similar taxa tend to behave similarly. However, this assumption may not always be valid due to the complex effects of microbes, and phylogenetic information should be incorporated in a data-supervised fashion. RESULTS In this work, we propose a local collapsing test called phylogeny-guided microbiome OTU-specific association test (POST). In POST, whether or not to borrow information and how much information to borrow from the neighboring OTUs in the phylogenetic tree are supervised by phylogenetic distance and the outcome-OTU association. POST is constructed under the kernel machine framework to accommodate complex OTU effects and extends kernel machine microbiome tests from community level to OTU level. Using simulation studies, we show that when the phylogenetic tree is informative, POST has better performance than existing OTU-level association tests. When the phylogenetic tree is not informative, POST achieves similar performance as existing methods. Finally, in real data applications on bacterial vaginosis and on preterm birth, we find that POST can identify similar or more outcome-associated OTUs that are of biological relevance compared to existing methods. CONCLUSIONS Using POST, we show that adaptively leveraging the phylogenetic information can enhance the selection performance of associated microbiome features by improving the overall true-positive and false-positive detection. We developed a user friendly R package POSTm which is freely available on CRAN ( https://CRAN.R-project.org/package=POSTm ). Video Abstract.
Collapse
Affiliation(s)
- Caizhi Huang
- Bioinformatics Research Center, North Carolina State University, Raleigh, 27606, USA
| | - Benjamin J Callahan
- Bioinformatics Research Center, North Carolina State University, Raleigh, 27606, USA
- Department of Population Health and Pathobiology, North Carolina State University, Raleigh, 27607, USA
| | - Michael C Wu
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, 98109, USA
| | - Shannon T Holloway
- Department of Statistics, North Carolina State University, Raleigh, 27606, USA
| | - Hayden Brochu
- Bioinformatics Research Center, North Carolina State University, Raleigh, 27606, USA
- Department of Molecular Biomedical Sciences, North Carolina State University, Raleigh, 27607, USA
| | - Wenbin Lu
- Department of Statistics, North Carolina State University, Raleigh, 27606, USA
| | - Xinxia Peng
- Bioinformatics Research Center, North Carolina State University, Raleigh, 27606, USA
- Department of Molecular Biomedical Sciences, North Carolina State University, Raleigh, 27607, USA
| | - Jung-Ying Tzeng
- Bioinformatics Research Center, North Carolina State University, Raleigh, 27606, USA.
- Department of Statistics, North Carolina State University, Raleigh, 27606, USA.
| |
Collapse
|
7
|
Zhou H, He K, Chen J, Zhang X. LinDA: linear models for differential abundance analysis of microbiome compositional data. Genome Biol 2022; 23:95. [PMID: 35421994 PMCID: PMC9012043 DOI: 10.1186/s13059-022-02655-5] [Citation(s) in RCA: 70] [Impact Index Per Article: 35.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2021] [Accepted: 03/14/2022] [Indexed: 12/12/2022] Open
Abstract
Differential abundance analysis is at the core of statistical analysis of microbiome data. The compositional nature of microbiome sequencing data makes false positive control challenging. Here, we show that the compositional effects can be addressed by a simple, yet highly flexible and scalable, approach. The proposed method, LinDA, only requires fitting linear regression models on the centered log-ratio transformed data, and correcting the bias due to compositional effects. We show that LinDA enjoys asymptotic FDR control and can be extended to mixed-effect models for correlated microbiome data. Using simulations and real examples, we demonstrate the effectiveness of LinDA.
Collapse
Affiliation(s)
- Huijuan Zhou
- Shanghai University of Finance and Economics, Shanghai, 200437 China
- Texas A&M University, College Station, 77843 USA
- Renmin University of China, Beijing, 100872 China
| | - Kejun He
- Renmin University of China, Beijing, 100872 China
| | | | | |
Collapse
|
8
|
Liu B, Sträuber H, Saraiva J, Harms H, Silva SG, Kasmanas JC, Kleinsteuber S, Nunes da Rocha U. Machine learning-assisted identification of bioindicators predicts medium-chain carboxylate production performance of an anaerobic mixed culture. MICROBIOME 2022; 10:48. [PMID: 35331330 PMCID: PMC8952268 DOI: 10.1186/s40168-021-01219-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/28/2021] [Accepted: 12/17/2021] [Indexed: 05/10/2023]
Abstract
BACKGROUND The ability to quantitatively predict ecophysiological functions of microbial communities provides an important step to engineer microbiota for desired functions related to specific biochemical conversions. Here, we present the quantitative prediction of medium-chain carboxylate production in two continuous anaerobic bioreactors from 16S rRNA gene dynamics in enriched communities. RESULTS By progressively shortening the hydraulic retention time (HRT) from 8 to 2 days with different temporal schemes in two bioreactors operated for 211 days, we achieved higher productivities and yields of the target products n-caproate and n-caprylate. The datasets generated from each bioreactor were applied independently for training and testing machine learning algorithms using 16S rRNA genes to predict n-caproate and n-caprylate productivities. Our dataset consisted of 14 and 40 samples from HRT of 8 and 2 days, respectively. Because of the size and balance of our dataset, we compared linear regression, support vector machine and random forest regression algorithms using the original and balanced datasets generated using synthetic minority oversampling. Further, we performed cross-validation to estimate model stability. The random forest regression was the best algorithm producing more consistent results with median of error rates below 8%. More than 90% accuracy in the prediction of n-caproate and n-caprylate productivities was achieved. Four inferred bioindicators belonging to the genera Olsenella, Lactobacillus, Syntrophococcus and Clostridium IV suggest their relevance to the higher carboxylate productivity at shorter HRT. The recovery of metagenome-assembled genomes of these bioindicators confirmed their genetic potential to perform key steps of medium-chain carboxylate production. CONCLUSIONS Shortening the hydraulic retention time of the continuous bioreactor systems allows to shape the communities with desired chain elongation functions. Using machine learning, we demonstrated that 16S rRNA amplicon sequencing data can be used to predict bioreactor process performance quantitatively and accurately. Characterizing and harnessing bioindicators holds promise to manage reactor microbiota towards selection of the target processes. Our mathematical framework is transferrable to other ecosystem processes and microbial systems where community dynamics is linked to key functions. The general methodology used here can be adapted to data types of other functional categories such as genes, transcripts, proteins or metabolites. Video Abstract.
Collapse
Affiliation(s)
- Bin Liu
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Heike Sträuber
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - João Saraiva
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Hauke Harms
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Sandra Godinho Silva
- Institute for Bioengineering and Biosciences, Department of Bioengineering, Instituto Superior Técnico Universidade de Lisboa, Lisbon, Portugal
| | - Jonas Coelho Kasmanas
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos, Brazil
- Department of Computer Science and Interdisciplinary Center of Bioinformatics, University of Leipzig, Leipzig, Germany
| | - Sabine Kleinsteuber
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany.
| | - Ulisses Nunes da Rocha
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany.
| |
Collapse
|
9
|
Zhang L, Wang Y, Chen J, Chen J. RFtest: A Robust and Flexible Community-Level Test for Microbiome Data Powerfully Detects Phylogenetically Clustered Signals. Front Genet 2022; 12:749573. [PMID: 35140735 PMCID: PMC8819960 DOI: 10.3389/fgene.2021.749573] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 11/09/2021] [Indexed: 12/31/2022] Open
Abstract
Random forest is considered as one of the most successful machine learning algorithms, which has been widely used to construct microbiome-based predictive models. However, its use as a statistical testing method has not been explored. In this study, we propose “Random Forest Test” (RFtest), a global (community-level) test based on random forest for high-dimensional and phylogenetically structured microbiome data. RFtest is a permutation test using the generalization error of random forest as the test statistic. Our simulations demonstrate that RFtest has controlled type I error rates, that its power is superior to competing methods for phylogenetically clustered signals, and that it is robust to outliers and adaptive to interaction effects and non-linear associations. Finally, we apply RFtest to two real microbiome datasets to ascertain whether microbial communities are associated or not with the outcome variables.
Collapse
Affiliation(s)
- Lujun Zhang
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, United States
- Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, China
| | - Yanshan Wang
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, United States
| | - Jingwen Chen
- Department of General Surgery, Zhongshan Hospital, Fudan University, Shanghai, China
- *Correspondence: Jingwen Chen, ; Jun Chen,
| | - Jun Chen
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, United States
- *Correspondence: Jingwen Chen, ; Jun Chen,
| |
Collapse
|
10
|
Revers A, Zhang X, Zwinderman AH. A Bayesian Negative Binomial Hierarchical Model for Identifying Diet-Gut Microbiome Associations. Front Microbiol 2021; 12:711861. [PMID: 34690956 PMCID: PMC8529249 DOI: 10.3389/fmicb.2021.711861] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2021] [Accepted: 08/20/2021] [Indexed: 11/13/2022] Open
Abstract
The human gut microbiota composition plays an important role in human health. Long-term diet intervention may shape human gut microbiome. Therefore, many studies focus on discovering links between long-term diets and gut microbiota composition. This study aimed to incorporate the phylogenetic relationships between the operational taxonomic units (OTUs) into the diet-microbe association analysis, using a Bayesian hierarchical negative binomial (NB) model. We regularized the dispersion parameter of the negative binomial distribution by assuming a mean-dispersion association. A simulation study showed that, if over-dispersion is present in the microbiome data, our approach performed better in terms of mean squared error (MSE) of the slope-estimates compared to the standard NB regression model or a Bayesian hierarchical NB model without including the phylogenetic relationships. Data of the Healthy Life in an Urban Setting (HELIUS) study showed that for some phylogenetic families the (posterior) variances of the slope-estimates were decreasing when including the phylogenetic relationships into the analyses. In contrast, when OTUs of the same family were not similarly affected by the food item, some bias was introduced, leading to larger (posterior) variances of the slope-estimates. Overall, the Bayesian hierarchical NB model, with a dependency between the mean and dispersion parameters, proved to be a robust method for analyzing diet-microbe associations.
Collapse
Affiliation(s)
- Alma Revers
- Department of Epidemiology and Data Science, Amsterdam University Medical Center, Amsterdam, Netherlands
| | - Xiang Zhang
- Theoretical Biology and Bioinformatics, Department of Biology, Utrecht University, Utrecht, Netherlands
| | - Aeilko H. Zwinderman
- Department of Epidemiology and Data Science, Amsterdam University Medical Center, Amsterdam, Netherlands
| |
Collapse
|
11
|
Bien J, Yan X, Simpson L, Müller CL. Tree-aggregated predictive modeling of microbiome data. Sci Rep 2021; 11:14505. [PMID: 34267244 PMCID: PMC8282688 DOI: 10.1038/s41598-021-93645-3] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 06/22/2021] [Indexed: 01/05/2023] Open
Abstract
Modern high-throughput sequencing technologies provide low-cost microbiome survey data across all habitats of life at unprecedented scale. At the most granular level, the primary data consist of sparse counts of amplicon sequence variants or operational taxonomic units that are associated with taxonomic and phylogenetic group information. In this contribution, we leverage the hierarchical structure of amplicon data and propose a data-driven and scalable tree-guided aggregation framework to associate microbial subcompositions with response variables of interest. The excess number of zero or low count measurements at the read level forces traditional microbiome data analysis workflows to remove rare sequencing variants or group them by a fixed taxonomic rank, such as genus or phylum, or by phylogenetic similarity. By contrast, our framework, which we call trac (tree-aggregation of compositional data), learns data-adaptive taxon aggregation levels for predictive modeling, greatly reducing the need for user-defined aggregation in preprocessing while simultaneously integrating seamlessly into the compositional data analysis framework. We illustrate the versatility of our framework in the context of large-scale regression problems in human gut, soil, and marine microbial ecosystems. We posit that the inferred aggregation levels provide highly interpretable taxon groupings that can help microbiome researchers gain insights into the structure and functioning of the underlying ecosystem of interest.
Collapse
Affiliation(s)
- Jacob Bien
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA, USA
| | | | - Léo Simpson
- Technische Universität München, Munich, Germany
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
| | - Christian L Müller
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany.
- Department of Statistics, Ludwig-Maximilians-Universität München, Munich, Germany.
- Center for Computational Mathematics, Flatiron Institute, Simons Foundation, New York, NY, USA.
| |
Collapse
|
12
|
Goren E, Wang C, He Z, Sheflin AM, Chiniquy D, Prenni JE, Tringe S, Schachtman DP, Liu P. Feature selection and causal analysis for microbiome studies in the presence of confounding using standardization. BMC Bioinformatics 2021; 22:362. [PMID: 34229628 PMCID: PMC8261956 DOI: 10.1186/s12859-021-04232-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Accepted: 06/03/2021] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND Microbiome studies have uncovered associations between microbes and human, animal, and plant health outcomes. This has led to an interest in developing microbial interventions for treatment of disease and optimization of crop yields which requires identification of microbiome features that impact the outcome in the population of interest. That task is challenging because of the high dimensionality of microbiome data and the confounding that results from the complex and dynamic interactions among host, environment, and microbiome. In the presence of such confounding, variable selection and estimation procedures may have unsatisfactory performance in identifying microbial features with an effect on the outcome. RESULTS In this manuscript, we aim to estimate population-level effects of individual microbiome features while controlling for confounding by a categorical variable. Due to the high dimensionality and confounding-induced correlation between features, we propose feature screening, selection, and estimation conditional on each stratum of the confounder followed by a standardization approach to estimation of population-level effects of individual features. Comprehensive simulation studies demonstrate the advantages of our approach in recovering relevant features. Utilizing a potential-outcomes framework, we outline assumptions required to ascribe causal, rather than associational, interpretations to the identified microbiome effects. We conducted an agricultural study of the rhizosphere microbiome of sorghum in which nitrogen fertilizer application is a confounding variable. In this study, the proposed approach identified microbial taxa that are consistent with biological understanding of potential plant-microbe interactions. CONCLUSIONS Standardization enables more accurate identification of individual microbiome features with an effect on the outcome of interest compared to other variable selection and estimation procedures when there is confounding by a categorical variable.
Collapse
Affiliation(s)
- Emily Goren
- Department of Statistics, Iowa State University, 2438 Osborn Dr, Ames, IA, 50011, USA
| | - Chong Wang
- Department of Statistics, Iowa State University, 2438 Osborn Dr, Ames, IA, 50011, USA.,Department of Veterinary Diagnostic and Production Animal Medicine, Iowa State University, 2203 Lloyd Veterinary Medical Center, Ames, IA, 50011, USA
| | - Zhulin He
- Department of Statistics, Iowa State University, 2438 Osborn Dr, Ames, IA, 50011, USA
| | - Amy M Sheflin
- Department of Horticulture and Landscape Architecture, Colorado State University, 301 University Ave, Fort Collins, CO, 80523, USA
| | - Dawn Chiniquy
- Department of Energy, Joint Genome Institute, 2800 Mitchell Dr, Walnut Creek, CA, 94598, USA
| | - Jessica E Prenni
- Department of Horticulture and Landscape Architecture, Colorado State University, 301 University Ave, Fort Collins, CO, 80523, USA
| | - Susannah Tringe
- Department of Energy, Joint Genome Institute, 2800 Mitchell Dr, Walnut Creek, CA, 94598, USA
| | - Daniel P Schachtman
- Department of Agronomy and Horticulture, University of Nebraska, 1825 N 38th St, Lincoln, NE, 68583, USA
| | - Peng Liu
- Department of Statistics, Iowa State University, 2438 Osborn Dr, Ames, IA, 50011, USA.
| |
Collapse
|
13
|
Jiang R, Li WV, Li JJ. mbImpute: an accurate and robust imputation method for microbiome data. Genome Biol 2021; 22:192. [PMID: 34183041 PMCID: PMC8240317 DOI: 10.1186/s13059-021-02400-4] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2021] [Accepted: 06/04/2021] [Indexed: 12/22/2022] Open
Abstract
A critical challenge in microbiome data analysis is the existence of many non-biological zeros, which distort taxon abundance distributions, complicate data analysis, and jeopardize the reliability of scientific discoveries. To address this issue, we propose the first imputation method for microbiome data-mbImpute-to identify and recover likely non-biological zeros by borrowing information jointly from similar samples, similar taxa, and optional metadata including sample covariates and taxon phylogeny. We demonstrate that mbImpute improves the power of identifying disease-related taxa from microbiome data of type 2 diabetes and colorectal cancer, and mbImpute preserves non-zero distributions of taxa abundances.
Collapse
Affiliation(s)
- Ruochen Jiang
- Department of Statistics, University of California, Los Angeles, 90095-1554, CA, USA
| | - Wei Vivian Li
- Department of Statistics, University of California, Los Angeles, 90095-1554, CA, USA
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, Piscataway, 08854, NJ, USA
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, 90095-1554, CA, USA.
- Department of Human Genetics, University of California, Los Angeles, 90095-7088, CA, USA.
- Department of Computational Medicine, University of California, Los Angeles, 90095-1766, CA, USA.
- Department of Biostatistics, University of California, Los Angeles, 90095-1772, CA, USA.
| |
Collapse
|
14
|
Sharma D, Paterson AD, Xu W. TaxoNN: ensemble of neural networks on stratified microbiome data for disease prediction. Bioinformatics 2021; 36:4544-4550. [PMID: 32449747 PMCID: PMC7750934 DOI: 10.1093/bioinformatics/btaa542] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Revised: 05/08/2020] [Accepted: 05/19/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Research supports the potential use of microbiome as a predictor of some diseases. Motivated by the findings that microbiome data is complex in nature, and there is an inherent correlation due to hierarchical taxonomy of microbial Operational Taxonomic Units (OTUs), we propose a novel machine learning method incorporating a stratified approach to group OTUs into phylum clusters. Convolutional Neural Networks (CNNs) were used to train within each of the clusters individually. Further, through an ensemble learning approach, features obtained from each cluster were then concatenated to improve prediction accuracy. Our two-step approach comprising stratification prior to combining multiple CNNs, aided in capturing the relationships between OTUs sharing a phylum efficiently, as compared to using a single CNN ignoring OTU correlations. RESULTS We used simulated datasets containing 168 OTUs in 200 cases and 200 controls for model testing. Thirty-two OTUs, potentially associated with risk of disease were randomly selected and interactions between three OTUs were used to introduce non-linearity. We also implemented this novel method in two human microbiome studies: (i) Cirrhosis with 118 cases, 114 controls; (ii) type 2 diabetes (T2D) with 170 cases, 174 controls; to demonstrate the model's effectiveness. Extensive experimentation and comparison against conventional machine learning techniques yielded encouraging results. We obtained mean AUC values of 0.88, 0.92, 0.75, showing a consistent increment (5%, 3%, 7%) in simulations, Cirrhosis and T2D data, respectively, against the next best performing method, Random Forest. AVAILABILITY AND IMPLEMENTATION https://github.com/divya031090/TaxoNN_OTU. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Divya Sharma
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada M5T 3M7
| | - Andrew D Paterson
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada M5T 3M7.,Genetics and Genome Biology Program, The Hospital for Sick Children, Toronto, ON, Canada, M5G 1X8
| | - Wei Xu
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada M5T 3M7.,Department of Biostatistics, Princess Margaret Cancer Center, University Health Network, Toronto, ON, Canada, M5G 2C1
| |
Collapse
|
15
|
Liu L, Gu H, Van Limbergen J, Kenney T. SuRF: A new method for sparse variable selection, with application in microbiome data analysis. Stat Med 2020; 40:897-919. [PMID: 33219557 DOI: 10.1002/sim.8809] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2019] [Revised: 10/25/2020] [Accepted: 10/27/2020] [Indexed: 01/16/2023]
Abstract
In this article, we present a new variable selection method for regression and classification purposes, particularly for microbiome analysis. Our method, called subsampling ranking forward selection (SuRF), is based on LASSO penalized regression, subsampling and forward-selection methods. SuRF offers major advantages over existing variable selection methods in terms of both sparsity of selected models and model inference. We provide an R package that can implement our method for generalized linear models. We apply our method to classification problems from microbiome data, using a novel agglomeration approach to deal with the special tree-like correlation structure of the variables. Existing methods arbitrarily choose a taxonomic level a priori before performing the analysis, whereas by combining SuRF with these aggregated variables, we are able to identify the key biomarkers at the appropriate taxonomic level, as suggested by the data. We present simulations in multiple sparse settings to demonstrate that our approach performs better than several other popularly used existing approaches in recovering the true variables. We apply SuRF to two microbiome datasets: one about prediction of pouchitis and another for identifying samples from two healthy individuals. We find that SuRF can provide a better or comparable prediction with other methods while controlling the false positive rate of variable selection.
Collapse
Affiliation(s)
- Lihui Liu
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, Canada
| | - Hong Gu
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, Canada
| | - Johan Van Limbergen
- Department of Pediatrics, Dalhousie University, Halifax, Nova Scotia, Canada
| | - Toby Kenney
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, Canada
| |
Collapse
|
16
|
Dong M, Li L, Chen M, Kusalik A, Xu W. Predictive analysis methods for human microbiome data with application to Parkinson's disease. PLoS One 2020; 15:e0237779. [PMID: 32834004 PMCID: PMC7446854 DOI: 10.1371/journal.pone.0237779] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2019] [Accepted: 08/03/2020] [Indexed: 12/22/2022] Open
Abstract
Microbiome data consists of operational taxonomic unit (OTU) counts characterized by zero-inflation, over-dispersion, and grouping structure among samples. Currently, statistical testing methods are commonly performed to identify OTUs that are associated with a phenotype. The limitations of statistical testing methods include that the validity of p-values/q-values depend sensitively on the correctness of models and that the statistical significance does not necessarily imply predictivity. Predictive analysis using methods such as LASSO is an alternative approach for identifying associated OTUs and for measuring the predictability of the phenotype variable with OTUs and other covariate variables. We investigate three strategies of performing predictive analysis: (1) LASSO: fitting a LASSO multinomial logistic regression model to all OTU counts with specific transformation; (2) screening+GLM: screening OTUs with q-values returned by fitting a GLMM to each OTU, then fitting a GLM model using a subset of selected OTUs; (3) screening+LASSO: fitting a LASSO to a subset of OTUs selected with GLMM. We have conducted empirical studies using three simulation datasets generated using Dirichlet-multinomial models and a real gut microbiome data related to Parkinson’s disease to investigate the performance of the three strategies for predictive analysis. Our simulation studies show that the predictive performance of LASSO with appropriate variable transformation works remarkably well on zero-inflated data. Our results of real data analysis show that Parkinson’s disease can be predicted based on selected OTUs after the binary transformation, age, and sex with high accuracy (Error Rate = 0.199, AUC = 0.872, AUPRC = 0.912). These results provide strong evidences of the relationship between Parkinson’s disease and the gut microbiome.
Collapse
Affiliation(s)
- Mei Dong
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| | - Longhai Li
- Department of Mathematics and Statistics, University of Saskatchewan, Saskatoon, SK, Canada
| | - Man Chen
- Department of Mathematics and Statistics, University of Saskatchewan, Saskatoon, SK, Canada
| | - Anthony Kusalik
- Department of Computer Science, University of Saskatchewan, Saskatoon, SK, Canada
| | - Wei Xu
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
- Department of Biostatistics, Princess Margaret Hospital, Toronto, ON, Canada
- * E-mail:
| |
Collapse
|
17
|
Pelpolage SW, Yoshida A, Nagata R, Shimada K, Fukuma N, Bochimoto H, Hamamoto T, Hoshizawa M, Nakano K, Han KH, Fukushima M. Frozen Autoclaved Sorghum Enhanced Colonic Fermentation and Lower Visceral Fat Accumulation in Rats. Nutrients 2020; 12:E2412. [PMID: 32806549 PMCID: PMC7570106 DOI: 10.3390/nu12082412] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Revised: 08/05/2020] [Accepted: 08/10/2020] [Indexed: 01/09/2023] Open
Abstract
As raw sorghum is not able to influence considerable colonic fermentation despite its higher resistant starch (RS) content, our study aimed to investigate the effects of frozen autoclaved sorghum on colonic fermentation. Fischer 344 rats were fed frozen cooked refined (S-Rf) and whole (S-Wh) sorghum diets and were compared against α-corn starch (CON) and high amylose starch (HAS) fed rats for zoometric parameters, cecal biochemical and microbiological parameters. Sorghum fed rats exhibited significantly lower feed intake and visceral adipose tissue mass compared to CON. Bacterial alpha diversity was significantly higher in the sorghum fed rats compared to HAS and the two sorghum fed groups clustered together, separately from HAS and CON in the beta diversity plot. Serum non-High Density Lipoprotein cholesterol and total cholesterol in S-Rf group were significantly lower compared to CON, while total fecal bile excretion was also significantly higher in the two sorghum fed groups. Lower visceral adiposity was correlated with lower feed intake, RS content ingested and cecal short chain fatty acid (SCFA) contents. Thus, higher RS inflow to the colon via frozen autoclaved sorghum might have influenced colonic fermentation of RS and the resultant SCFA might have influenced lower adiposity as manifested by the lower body weight gain.
Collapse
Affiliation(s)
- Samanthi W. Pelpolage
- Department of Life and Food Sciences, Obihiro University of Agriculture and Veterinary Medicine, West 2-11, Inada, Obihiro 080–8555, Hokkaido, Japan; (S.W.P.); (A.Y.); (K.S.); (R.N.); (N.F.); (K.-H.H.)
| | - Atsushi Yoshida
- Department of Life and Food Sciences, Obihiro University of Agriculture and Veterinary Medicine, West 2-11, Inada, Obihiro 080–8555, Hokkaido, Japan; (S.W.P.); (A.Y.); (K.S.); (R.N.); (N.F.); (K.-H.H.)
| | - Ryuji Nagata
- Department of Life and Food Sciences, Obihiro University of Agriculture and Veterinary Medicine, West 2-11, Inada, Obihiro 080–8555, Hokkaido, Japan; (S.W.P.); (A.Y.); (K.S.); (R.N.); (N.F.); (K.-H.H.)
| | - Kenichiro Shimada
- Department of Life and Food Sciences, Obihiro University of Agriculture and Veterinary Medicine, West 2-11, Inada, Obihiro 080–8555, Hokkaido, Japan; (S.W.P.); (A.Y.); (K.S.); (R.N.); (N.F.); (K.-H.H.)
| | - Naoki Fukuma
- Department of Life and Food Sciences, Obihiro University of Agriculture and Veterinary Medicine, West 2-11, Inada, Obihiro 080–8555, Hokkaido, Japan; (S.W.P.); (A.Y.); (K.S.); (R.N.); (N.F.); (K.-H.H.)
- Research Center for Global Agromedicine, Obihiro University of Agriculture and Veterinary Medicine, West 2-11, Inada, Obihiro 080-8555, Hokkaido, Japan
| | - Hiroki Bochimoto
- Division of Aerospace Medicine, Department of Cell Physiology, The Jikei University School of Medicine, 3-25-8 Nishishimbashi, Minatoku, Tokyo 105-8461, Japan;
| | - Tetsuo Hamamoto
- U.S. Grains Council, 11th Floor, Toranomon Denki Building No. 3, 1-2-20 Toranomon, Minato-ku, Tokyo 105-0001, Japan; (T.H.); (M.H.)
| | - Michiyo Hoshizawa
- U.S. Grains Council, 11th Floor, Toranomon Denki Building No. 3, 1-2-20 Toranomon, Minato-ku, Tokyo 105-0001, Japan; (T.H.); (M.H.)
| | - Koichi Nakano
- Nakano Industry Co., Asahishinmachi 33-25 Takamatsu, Kagawa 760-0064, Japan;
| | - Kyu-Ho Han
- Department of Life and Food Sciences, Obihiro University of Agriculture and Veterinary Medicine, West 2-11, Inada, Obihiro 080–8555, Hokkaido, Japan; (S.W.P.); (A.Y.); (K.S.); (R.N.); (N.F.); (K.-H.H.)
- Research Center for Global Agromedicine, Obihiro University of Agriculture and Veterinary Medicine, West 2-11, Inada, Obihiro 080-8555, Hokkaido, Japan
| | - Michihiro Fukushima
- Department of Life and Food Sciences, Obihiro University of Agriculture and Veterinary Medicine, West 2-11, Inada, Obihiro 080–8555, Hokkaido, Japan; (S.W.P.); (A.Y.); (K.S.); (R.N.); (N.F.); (K.-H.H.)
| |
Collapse
|
18
|
Zhang L, Shi Y, Jenq RR, Do KA, Peterson CB. Bayesian compositional regression with structured priors for microbiome feature selection. Biometrics 2020; 77:824-838. [PMID: 32686846 DOI: 10.1111/biom.13335] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Accepted: 07/13/2020] [Indexed: 01/10/2023]
Abstract
The microbiome plays a critical role in human health and disease, and there is a strong scientific interest in linking specific features of the microbiome to clinical outcomes. There are key aspects of microbiome data, however, that limit the applicability of standard variable selection methods. In particular, the observed data are compositional, as the counts within each sample have a fixed-sum constraint. In addition, microbiome features, typically quantified as operational taxonomic units, often reflect microorganisms that are similar in function, and may therefore have a similar influence on the response variable. To address the challenges posed by these aspects of the data structure, we propose a variable selection technique with the following novel features: a generalized transformation and z-prior to handle the compositional constraint, and an Ising prior that encourages the joint selection of microbiome features that are closely related in terms of their genetic sequence similarity. We demonstrate that our proposed method outperforms existing penalized approaches for microbiome variable selection in both simulation and the analysis of real data exploring the relationship of the gut microbiome to body mass index.
Collapse
Affiliation(s)
- Liangliang Zhang
- Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, Texas
| | - Yushu Shi
- Department of Statistics, University of Missouri, Columbia, Missouri
| | - Robert R Jenq
- Department of Genomic Medicine, University of Texas MD Anderson Cancer Center, Houston, Texas
| | - Kim-Anh Do
- Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, Texas
| | - Christine B Peterson
- Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, Texas
| |
Collapse
|
19
|
Crawford J, Greene CS. Incorporating biological structure into machine learning models in biomedicine. Curr Opin Biotechnol 2020; 63:126-134. [PMID: 31962244 PMCID: PMC7308204 DOI: 10.1016/j.copbio.2019.12.021] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2019] [Revised: 12/17/2019] [Accepted: 12/19/2019] [Indexed: 12/19/2022]
Abstract
In biomedical applications of machine learning, relevant information often has a rich structure that is not easily encoded as real-valued predictors. Examples of such data include DNA or RNA sequences, gene sets or pathways, gene interaction or coexpression networks, ontologies, and phylogenetic trees. We highlight recent examples of machine learning models that use structure to constrain model architecture or incorporate structured data into model training. For machine learning in biomedicine, where sample size is limited and model interpretability is crucial, incorporating prior knowledge in the form of structured data can be particularly useful. The area of research would benefit from performant open source implementations and independent benchmarking efforts.
Collapse
Affiliation(s)
- Jake Crawford
- Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States; Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States; Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Philadelphia, PA, United States.
| |
Collapse
|
20
|
Xia Y. Correlation and association analyses in microbiome study integrating multiomics in health and disease. PROGRESS IN MOLECULAR BIOLOGY AND TRANSLATIONAL SCIENCE 2020; 171:309-491. [PMID: 32475527 DOI: 10.1016/bs.pmbts.2020.04.003] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Correlation and association analyses are one of the most widely used statistical methods in research fields, including microbiome and integrative multiomics studies. Correlation and association have two implications: dependence and co-occurrence. Microbiome data are structured as phylogenetic tree and have several unique characteristics, including high dimensionality, compositionality, sparsity with excess zeros, and heterogeneity. These unique characteristics cause several statistical issues when analyzing microbiome data and integrating multiomics data, such as large p and small n, dependency, overdispersion, and zero-inflation. In microbiome research, on the one hand, classic correlation and association methods are still applied in real studies and used for the development of new methods; on the other hand, new methods have been developed to target statistical issues arising from unique characteristics of microbiome data. Here, we first provide a comprehensive view of classic and newly developed univariate correlation and association-based methods. We discuss the appropriateness and limitations of using classic methods and demonstrate how the newly developed methods mitigate the issues of microbiome data. Second, we emphasize that concepts of correlation and association analyses have been shifted by introducing network analysis, microbe-metabolite interactions, functional analysis, etc. Third, we introduce multivariate correlation and association-based methods, which are organized by the categories of exploratory, interpretive, and discriminatory analyses and classification methods. Fourth, we focus on the hypothesis testing of univariate and multivariate regression-based association methods, including alpha and beta diversities-based, count-based, and relative abundance (or compositional)-based association analyses. We demonstrate the characteristics and limitations of each approaches. Fifth, we introduce two specific microbiome-based methods: phylogenetic tree-based association analysis and testing for survival outcomes. Sixth, we provide an overall view of longitudinal methods in analysis of microbiome and omics data, which cover standard, static, regression-based time series methods, principal trend analysis, and newly developed univariate overdispersed and zero-inflated as well as multivariate distance/kernel-based longitudinal models. Finally, we comment on current association analysis and future direction of association analysis in microbiome and multiomics studies.
Collapse
Affiliation(s)
- Yinglin Xia
- Department of Medicine, University of Illinois at Chicago, Chicago, IL, United States.
| |
Collapse
|
21
|
Wang Y, Bhattacharya T, Jiang Y, Qin X, Wang Y, Liu Y, Saykin AJ, Chen L. A novel deep learning method for predictive modeling of microbiome data. Brief Bioinform 2020; 22:5835556. [PMID: 32406914 DOI: 10.1093/bib/bbaa073] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2019] [Revised: 02/22/2020] [Accepted: 04/10/2020] [Indexed: 12/22/2022] Open
Abstract
With the development and decreasing cost of next-generation sequencing technologies, the study of the human microbiome has become a rapid expanding research field, which provides an unprecedented opportunity in various clinical applications such as drug response predictions and disease diagnosis. It is thus essential and desirable to build a prediction model for clinical outcomes based on microbiome data that usually consist of taxon abundance and a phylogenetic tree. Importantly, all microbial species are not uniformly distributed in the phylogenetic tree but tend to be clustered at different phylogenetic depths. Therefore, the phylogenetic tree represents a unique correlation structure of microbiome, which can be an important prior to improve the prediction performance. However, prediction methods that consider the phylogenetic tree in an efficient and rigorous way are under-developed. Here, we develop a novel deep learning prediction method MDeep (microbiome-based deep learning method) to predict both continuous and binary outcomes. Conceptually, MDeep designs convolutional layers to mimic taxonomic ranks with multiple convolutional filters on each convolutional layer to capture the phylogenetic correlation among microbial species in a local receptive field and maintain the correlation structure across different convolutional layers via feature mapping. Taken together, the convolutional layers with its built-in convolutional filters capture microbial signals at different taxonomic levels while encouraging local smoothing and preserving local connectivity induced by the phylogenetic tree. We use both simulation studies and real data applications to demonstrate that MDeep outperforms competing methods in both regression and binary classifications. Availability and Implementation: MDeep software is available at https://github.com/lichen-lab/MDeep Contact:chen61@iu.edu.
Collapse
|
22
|
Bichat A, Plassais J, Ambroise C, Mariadassou M. Incorporating Phylogenetic Information in Microbiome Differential Abundance Studies Has No Effect on Detection Power and FDR Control. Front Microbiol 2020; 11:649. [PMID: 32351481 PMCID: PMC7174607 DOI: 10.3389/fmicb.2020.00649] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2019] [Accepted: 03/20/2020] [Indexed: 12/18/2022] Open
Abstract
We consider the problem of incorporating evolutionary information (e.g., taxonomic or phylogenic trees) in the context of metagenomics differential analysis. Recent results published in the literature propose different ways to leverage the tree structure to increase the detection rate of differentially abundant taxa. Here, we propose instead to use a different hierarchical structure, in the form of a correlation-based tree, as it may capture the structure of the data better than the phylogeny. We first show that the correlation tree and the phylogeny are significantly different before turning to the impact of tree choice on detection rates. Using synthetic data, we show that the tree does have an impact: smoothing p-values according to the phylogeny leads to equal or inferior rates as smoothing according to the correlation tree. However, both trees are outperformed by the classical, non-hierarchical, Benjamini–Hochberg (BH) procedure in terms of detection rates. Other procedures may use the hierarchical structure with profit but do not control the False Discovery Rate (FDR) a priori and remain inferior to a classical Benjamini–Hochberg procedure with the same nominal FDR. On real datasets, no hierarchical procedure had significantly higher detection rate that BH. Intuition advocates that the use of hierarchical structures should increase the detection rate of differentially abundant taxa in microbiome studies. However, our results suggest that current hierarchical procedures are still inferior to standard methods and more effective procedures remain to be invented.
Collapse
Affiliation(s)
- Antoine Bichat
- LaMME, Université Paris-Saclay, CNRS, Université d'Évry val d'Essonne, Évry, France.,Enterome, Paris, France
| | | | - Christophe Ambroise
- LaMME, Université Paris-Saclay, CNRS, Université d'Évry val d'Essonne, Évry, France
| | | |
Collapse
|
23
|
Martinez S, Garcia JG, Williams R, Elmassry M, West A, Hamood A, Hurtado D, Gudenkauf B, Ventolini G, Schlabritz-Loutsevitch N. Lactobacilli spp.: real-time evaluation of biofilm growth. BMC Microbiol 2020; 20:64. [PMID: 32209050 PMCID: PMC7092459 DOI: 10.1186/s12866-020-01753-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2019] [Accepted: 03/13/2020] [Indexed: 01/10/2023] Open
Abstract
BACKGROUND Biofilm is a fundamental bacterial survival mode which proceeds through three main generalized phases: adhesion, maturation, and dispersion. Lactobacilli spp. (LB) are critical components of gut and reproductive health and are widely used probiotics. Evaluation of time-dependent mechanisms of biofilm formation is important for understanding of host-microbial interaction and development of therapeutic interventions. Time-dependent LB biofilm growth was studied in two systems: large biofilm output in continuous flow system (microfermenter (M), Institute Pasteur, France) and electrical impedance-based real time label-free cell analyzer (C) (xCELLigence, ACEA Bioscience Inc., San Diego, CA). L. plantarum biofilm growth in M system was video-recorded, followed by analyses using IMARIS software (Bitplane, Oxford Instrument Company, Concord, MA, USA). Additionally, whole genome expression and analyses of attached (A) and dispersed (D) biofilm phases at 24 and 48 h were performed. RESULTS The dynamic of biofilm growth of L. plantarum was similar in both systems except for D phases. Comparison of the transcriptome of A and D phases revealed, that 121 transcripts differ between two phases at 24 h. and 35 transcripts - at 48 h. of M growth. The main pathways, down-regulated in A compared to D phases after 24 h. were transcriptional regulation, purine nucleotide biosynthesis, and L-aspartate biosynthesis, and the upregulated pathways were fatty acid and phospholipid metabolism as well as ABC transporters and purine nucleotide biosynthesis. Four LB species differed in the duration and amplitude of attachment phases, while growth phases were similar. CONCLUSION LB spp. biofilm growth and propagation area dynamic, time-dependent processes with species-specific and time specific characteristics. The dynamic of LB biofilm growth agrees with published pathophysiological data and points out that real time evaluation is an important tool in understanding growth of microbial communities.
Collapse
Affiliation(s)
- Stacy Martinez
- Texas Tech University Health Sciences Center at the Permian Basin, 701 W. 5th Street, Odessa, TX, 79763, USA
| | - Jonathan Gomez Garcia
- Texas Tech University Health Sciences Center at the Permian Basin, 701 W. 5th Street, Odessa, TX, 79763, USA.,University of Texas at the Permian Basin, Odessa, TX, USA
| | - Roy Williams
- Texas Tech University Health Sciences Center at the Permian Basin, 701 W. 5th Street, Odessa, TX, 79763, USA.,University of Texas at the Permian Basin, Odessa, TX, USA
| | - Moamen Elmassry
- Department of Biological Sciences, Texas Tech University, Lubbock, TX, USA
| | - Andrew West
- Texas Tech University Health Sciences Center at the Permian Basin, 701 W. 5th Street, Odessa, TX, 79763, USA
| | - Abdul Hamood
- Department of Microbiology and Immunology, Texas Tech University Health Sciences Center, Lubbock, TX, USA
| | | | - Brent Gudenkauf
- Texas Tech University Health Sciences Center at the Permian Basin, 701 W. 5th Street, Odessa, TX, 79763, USA
| | - Gary Ventolini
- Texas Tech University Health Sciences Center at the Permian Basin, 701 W. 5th Street, Odessa, TX, 79763, USA.
| | - Natalia Schlabritz-Loutsevitch
- Texas Tech University Health Sciences Center at the Permian Basin, 701 W. 5th Street, Odessa, TX, 79763, USA. .,Department of Neurobiology and Pharmacology, Texas Tech University Health Sciences Center, Lubbock, TX, USA.
| |
Collapse
|
24
|
Xiao J, Chen L, Yu Y, Zhang X, Chen J. A Phylogeny-Regularized Sparse Regression Model for Predictive Modeling of Microbial Community Data. Front Microbiol 2018; 9:3112. [PMID: 30619188 PMCID: PMC6305753 DOI: 10.3389/fmicb.2018.03112] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2018] [Accepted: 12/03/2018] [Indexed: 12/16/2022] Open
Abstract
Fueled by technological advancement, there has been a surge of human microbiome studies surveying the microbial communities associated with the human body and their links with health and disease. As a complement to the human genome, the human microbiome holds great potential for precision medicine. Efficient predictive models based on microbiome data could be potentially used in various clinical applications such as disease diagnosis, patient stratification and drug response prediction. One important characteristic of the microbial community data is the phylogenetic tree that relates all the microbial taxa based on their evolutionary history. The phylogenetic tree is an informative prior for more efficient prediction since the microbial community changes are usually not randomly distributed on the tree but tend to occur in clades at varying phylogenetic depths (clustered signal). Although community-wide changes are possible for some conditions, it is also likely that the community changes are only associated with a small subset of "marker" taxa (sparse signal). Unfortunately, predictive models of microbial community data taking into account both the sparsity and the tree structure remain under-developed. In this paper, we propose a predictive framework to exploit sparse and clustered microbiome signals using a phylogeny-regularized sparse regression model. Our approach is motivated by evolutionary theory, where a natural correlation structure among microbial taxa exists according to the phylogenetic relationship. A novel phylogeny-based smoothness penalty is proposed to smooth the coefficients of the microbial taxa with respect to the phylogenetic tree. Using simulated and real datasets, we show that our method achieves better prediction performance than competing sparse regression methods for sparse and clustered microbiome signals.
Collapse
Affiliation(s)
- Jian Xiao
- Division of Biomedical Statistics and Informatics, Center for Individualized Medicine, Mayo Clinic Rochester, MN, United States.,School of Statistics and Mathematics Zhongnan University of Economics and Law, Wuhan, China
| | - Li Chen
- Department of Health Outcomes Research and Policy, Harrison School of Pharmacy, Auburn University Auburn, AL, United States
| | - Yue Yu
- Division of Biomedical Statistics and Informatics, Center for Individualized Medicine, Mayo Clinic Rochester, MN, United States
| | - Xianyang Zhang
- Department of Statistics, Texas A&M University College Station, TX, United States
| | - Jun Chen
- Division of Biomedical Statistics and Informatics, Center for Individualized Medicine, Mayo Clinic Rochester, MN, United States
| |
Collapse
|