1
|
LeBlanc P, Ma L. Microbiome subcommunity learning with logistic-tree normal latent Dirichlet allocation. Biometrics 2023; 79:2321-2332. [PMID: 36222326 PMCID: PMC10090221 DOI: 10.1111/biom.13772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2021] [Accepted: 09/26/2022] [Indexed: 11/28/2022]
Abstract
Mixed-membership (MM) models such as latent Dirichlet allocation (LDA) have been applied to microbiome compositional data to identify latent subcommunities of microbial species. These subcommunities are informative for understanding the biological interplay of microbes and for predicting health outcomes. However, microbiome compositions typically display substantial cross-sample heterogeneities in subcommunity compositions-that is, the variability in the proportions of microbes in shared subcommunities across samples-which is not accounted for in prior analyses. As a result, LDA can produce inference, which is highly sensitive to the specification of the number of subcommunities and often divides a single subcommunity into multiple artificial ones. To address this limitation, we incorporate the logistic-tree normal (LTN) model into LDA to form a new MM model. This model allows cross-sample variation in the composition of each subcommunity around some "centroid" composition that defines the subcommunity. Incorporation of auxiliary Pólya-Gamma variables enables a computationally efficient collapsed blocked Gibbs sampler to carry out Bayesian inference under this model. By accounting for such heterogeneity, our new model restores the robustness of the inference in the specification of the number of subcommunities and allows meaningful subcommunities to be identified.
Collapse
Affiliation(s)
- Patrick LeBlanc
- Department of Statistical Sciences, Duke University, Durham, North Carolina, USA
| | - Li Ma
- Department of Statistical Sciences, Duke University, Durham, North Carolina, USA
- Department of Biostatistics and Bioinformatics, Duke University Medical School, Durham, North Carolina, USA
| |
Collapse
|
2
|
Hong Q, Chen G, Tang ZZ. PhyloMed: a phylogeny-based test of mediation effect in microbiome. Genome Biol 2023; 24:72. [PMID: 37041566 PMCID: PMC10088256 DOI: 10.1186/s13059-023-02902-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Accepted: 03/15/2023] [Indexed: 04/13/2023] Open
Abstract
Microbiome data from sequencing experiments contain the relative abundance of a large number of microbial taxa with their evolutionary relationships represented by a phylogenetic tree. The compositional and high-dimensional nature of the microbiome mediator challenges the validity of standard mediation analyses. We propose a phylogeny-based mediation analysis method called PhyloMed to address this challenge. Unlike existing methods that directly identify individual mediating taxa, PhyloMed discovers mediation signals by analyzing subcompositions defined on the phylogenic tree. PhyloMed produces well-calibrated mediation test p-values and yields substantially higher discovery power than existing methods.
Collapse
Affiliation(s)
- Qilin Hong
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, 53715, USA
| | - Guanhua Chen
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, 53715, USA
| | - Zheng-Zheng Tang
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, 53715, USA.
| |
Collapse
|
3
|
Pedone M, Amedei A, Stingo FC. Subject-specific Dirichlet-multinomial regression for multi-district microbiota data analysis. Ann Appl Stat 2023. [DOI: 10.1214/22-aoas1641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Affiliation(s)
- Matteo Pedone
- Department of Statistics, Computer Science, Applications, University of Florence
| | - Amedeo Amedei
- Department of Clinical and Experimental Medicine, University of Florence
| | - Francesco C. Stingo
- Department of Statistics, Computer Science, Applications, University of Florence
| |
Collapse
|
4
|
Shi Y, Zhang L, Do KA, Jenq R, Peterson CB. Sparse tree-based clustering of microbiome data to characterize microbiome heterogeneity in pancreatic cancer. J R Stat Soc Ser C Appl Stat 2023; 72:20-36. [PMID: 37034187 PMCID: PMC10077950 DOI: 10.1093/jrsssc/qlac002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/16/2023]
Abstract
There is a keen interest in characterizing variation in the microbiome across cancer patients, given increasing evidence of its important role in determining treatment outcomes. Here our goal is to discover subgroups of patients with similar microbiome profiles. We propose a novel unsupervised clustering approach in the Bayesian framework that innovates over existing model-based clustering approaches, such as the Dirichlet multinomial mixture model, in three key respects: we incorporate feature selection, learn the appropriate number of clusters from the data, and integrate information on the tree structure relating the observed features. We compare the performance of our proposed method to existing methods on simulated data designed to mimic real microbiome data. We then illustrate results obtained for our motivating data set, a clinical study aimed at characterizing the tumor microbiome of pancreatic cancer patients.
Collapse
Affiliation(s)
- Yushu Shi
- Department of Statistics, University of Missouri, Columbia, Columbia, MO, USA
| | - Liangliang Zhang
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, OH, USA
| | - Kim-Anh Do
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Robert Jenq
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Christine B Peterson
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| |
Collapse
|
5
|
Mao J, Ma L. Dirichlet-tree multinomial mixtures for clustering microbiome compositions. Ann Appl Stat 2022; 16:1476-1499. [DOI: 10.1214/21-aoas1552] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Jialiang Mao
- Department of Statistical Science, Duke University
| | - Li Ma
- Department of Statistical Science, Duke University
| |
Collapse
|
6
|
Osborne N, Peterson CB, Vannucci M. Latent Network Estimation and Variable Selection for Compositional Data Via Variational EM. J Comput Graph Stat 2022; 31:163-175. [PMID: 36776345 PMCID: PMC9909885 DOI: 10.1080/10618600.2021.1935971] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
Network estimation and variable selection have been extensively studied in the statistical literature, but only recently have those two challenges been addressed simultaneously. In this article, we seek to develop a novel method to simultaneously estimate network interactions and associations to relevant covariates for count data, and specifically for compositional data, which have a fixed sum constraint. We use a hierarchical Bayesian model with latent layers and employ spike-and-slab priors for both edge and covariate selection. For posterior inference, we develop a novel variational inference scheme with an expectation-maximization step, to enable efficient estimation. Through simulation studies, we demonstrate that the proposed model outperforms existing methods in its accuracy of network recovery. We show the practical utility of our model via an application to microbiome data. The human microbiome has been shown to contribute too many of the functions of the human body, and also to be linked with a number of diseases. In our application, we seek to better understand the interaction between microbes and relevant covariates, as well as the interaction of microbes with each other. We call our algorithm simultaneous inference for networks and covariates and provide a Python implementation, which is available online.
Collapse
Affiliation(s)
| | - Christine B. Peterson
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX
| | | |
Collapse
|
7
|
Zhang Q, Dao T. A distance based multisample test for high-dimensional compositional data with applications to the human microbiome. BMC Bioinformatics 2020; 21:205. [PMID: 33272203 PMCID: PMC7713147 DOI: 10.1186/s12859-020-3530-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2020] [Accepted: 04/30/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Compositional data refer to the data that lie on a simplex, which are common in many scientific domains such as genomics, geology and economics. As the components in a composition must sum to one, traditional tests based on unconstrained data become inappropriate, and new statistical methods are needed to analyze this special type of data. RESULTS In this paper, we consider a general problem of testing for the compositional difference between K populations. Motivated by microbiome and metagenomics studies, where the data are often over-dispersed and high-dimensional, we formulate a well-posed hypothesis from a Bayesian point of view and suggest a nonparametric test based on inter-point distance to evaluate statistical significance. Unlike most existing tests for compositional data, our method does not rely on any data transformation, sparsity assumption or regularity conditions on the covariance matrix, but directly analyzes the compositions. Simulated data and two real data sets on the human microbiome are used to illustrate the promise of our method. CONCLUSIONS Our simulation studies and real data applications demonstrate that the proposed test is more sensitive to the compositional difference than the mean-based method, especially when the data are over-dispersed or zero-inflated. The proposed test is easy to implement and computationally efficient, facilitating its application to large-scale datasets.
Collapse
Affiliation(s)
- Qingyang Zhang
- Department of Mathematical Sciences, University of Arkansas, Fayetteville, AR 72701, USA.
| | - Thy Dao
- Department of Mathematical Sciences, University of Arkansas, Fayetteville, AR 72701, USA
| |
Collapse
|
8
|
Koslovsky MD, Hoffman KL, Daniel CR, Vannucci M. A Bayesian model of microbiome data for simultaneous identification of covariate associations and prediction of phenotypic outcomes. Ann Appl Stat 2020. [DOI: 10.1214/20-aoas1354] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
9
|
Koslovsky MD, Vannucci M. MicroBVS: Dirichlet-tree multinomial regression models with Bayesian variable selection - an R package. BMC Bioinformatics 2020; 21:301. [PMID: 32660471 PMCID: PMC7359232 DOI: 10.1186/s12859-020-03640-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2019] [Accepted: 07/02/2020] [Indexed: 11/29/2022] Open
Abstract
Background Understanding the relation between the human microbiome and modulating factors, such as diet, may help researchers design intervention strategies that promote and maintain healthy microbial communities. Numerous analytical tools are available to help identify these relations, oftentimes via automated variable selection methods. However, available tools frequently ignore evolutionary relations among microbial taxa, potential relations between modulating factors, as well as model selection uncertainty. Results We present MicroBVS, an R package for Dirichlet-tree multinomial models with Bayesian variable selection, for the identification of covariates associated with microbial taxa abundance data. The underlying Bayesian model accommodates phylogenetic structure in the abundance data and various parameterizations of covariates’ prior probabilities of inclusion. Conclusion While developed to study the human microbiome, our software can be employed in various research applications, where the aim is to generate insights into the relations between a set of covariates and compositional data with or without a known tree-like structure.
Collapse
|
10
|
Liu T, Zhao H, Wang T. An empirical Bayes approach to normalization and differential abundance testing for microbiome data. BMC Bioinformatics 2020; 21:225. [PMID: 32493208 PMCID: PMC7268703 DOI: 10.1186/s12859-020-03552-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Accepted: 05/18/2020] [Indexed: 12/14/2022] Open
Abstract
Background Advances in DNA sequencing have offered researchers an unprecedented opportunity to better study the variety of species living in and on the human body. However, the analysis of microbiome data is complicated by several challenges. First, the sequencing depth may vary by orders of magnitude across samples. Second, species are rare and the data often contain many zeros. Third, the specimen is a fraction of the microbial ecosystem, and so the data are compositional carrying only relative information. Other characteristics of microbiome data include pronounced over-dispersion in taxon abundances, and the existence of a phylogenetic tree that relates all bacterial species. To address some of these challenges, microbiome analysis workflows often normalize the read counts prior to downstream analysis. However, there are limitations in the current literature on the normalization of microbiome data. Results Under the multinomial distribution for the read counts and a prior for the unknown proportions, we propose an empirical Bayes approach to microbiome data normalization. Using a tree-based extension of the Dirichlet prior, we further extend our method by incorporating the phylogenetic tree into the normalization process. We study the impact of normalization on differential abundance analysis. In the presence of tree structure, we propose a phylogeny-aware detection procedure. Conclusions Extensive simulations and gut microbiome data applications are conducted to demonstrate the superior performance of our empirical Bayes method over other normalization methods, and over commonly-used methods for differential abundance testing. Original R scripts are available at GitHub (https://github.com/liudoubletian/eBay).
Collapse
Affiliation(s)
- Tiantian Liu
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China.,SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China
| | - Hongyu Zhao
- Department of Biostatistics, Yale University, 300 George Street, New Haven, 06511, USA.,SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China
| | - Tao Wang
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China. .,SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China. .,MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China.
| |
Collapse
|
11
|
Xia Y. Correlation and association analyses in microbiome study integrating multiomics in health and disease. PROGRESS IN MOLECULAR BIOLOGY AND TRANSLATIONAL SCIENCE 2020; 171:309-491. [PMID: 32475527 DOI: 10.1016/bs.pmbts.2020.04.003] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Correlation and association analyses are one of the most widely used statistical methods in research fields, including microbiome and integrative multiomics studies. Correlation and association have two implications: dependence and co-occurrence. Microbiome data are structured as phylogenetic tree and have several unique characteristics, including high dimensionality, compositionality, sparsity with excess zeros, and heterogeneity. These unique characteristics cause several statistical issues when analyzing microbiome data and integrating multiomics data, such as large p and small n, dependency, overdispersion, and zero-inflation. In microbiome research, on the one hand, classic correlation and association methods are still applied in real studies and used for the development of new methods; on the other hand, new methods have been developed to target statistical issues arising from unique characteristics of microbiome data. Here, we first provide a comprehensive view of classic and newly developed univariate correlation and association-based methods. We discuss the appropriateness and limitations of using classic methods and demonstrate how the newly developed methods mitigate the issues of microbiome data. Second, we emphasize that concepts of correlation and association analyses have been shifted by introducing network analysis, microbe-metabolite interactions, functional analysis, etc. Third, we introduce multivariate correlation and association-based methods, which are organized by the categories of exploratory, interpretive, and discriminatory analyses and classification methods. Fourth, we focus on the hypothesis testing of univariate and multivariate regression-based association methods, including alpha and beta diversities-based, count-based, and relative abundance (or compositional)-based association analyses. We demonstrate the characteristics and limitations of each approaches. Fifth, we introduce two specific microbiome-based methods: phylogenetic tree-based association analysis and testing for survival outcomes. Sixth, we provide an overall view of longitudinal methods in analysis of microbiome and omics data, which cover standard, static, regression-based time series methods, principal trend analysis, and newly developed univariate overdispersed and zero-inflated as well as multivariate distance/kernel-based longitudinal models. Finally, we comment on current association analysis and future direction of association analysis in microbiome and multiomics studies.
Collapse
Affiliation(s)
- Yinglin Xia
- Department of Medicine, University of Illinois at Chicago, Chicago, IL, United States.
| |
Collapse
|
12
|
Affiliation(s)
- Jialiang Mao
- Department of Statistical Science, Duke University, Durham, NC
| | - Yuhan Chen
- Department of Statistical Science, Duke University, Durham, NC
| | - Li Ma
- Department of Statistical Science, Duke University, Durham, NC
| |
Collapse
|
13
|
Xia Y, Sun J, Chen DG. Introductory Overview of Statistical Analysis of Microbiome Data. STATISTICAL ANALYSIS OF MICROBIOME DATA WITH R 2018. [DOI: 10.1007/978-981-13-1534-3_3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
|