1
|
Sun Z, Song K. GEMimp: An accurate and robust imputation method for microbiome data using graph embedding neural network. J Mol Biol 2024:168841. [PMID: 39490678 DOI: 10.1016/j.jmb.2024.168841] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Revised: 10/23/2024] [Accepted: 10/23/2024] [Indexed: 11/05/2024]
Abstract
Microbiome research has increasingly underscored the profound link between microbial compositions and human health, with numerous studies establishing a strong correlation between microbiome characteristics and various diseases. However, the analysis of microbiome data is frequently compromised by inherent sparsity issues, characterized by a substantial presence of observed zeros. These zeros not only skew the abundance distribution of microbial species but also undermine the reliability of scientific conclusions drawn from such data. Addressing this challenge, we introduce GEMimp, an innovative imputation method designed to infuse robustness into microbiome data analysis. GEMimp leverages the node2vec algorithm, which incorporates both Breadth-First Search (BFS) and Depth-First Search (DFS) strategies in its random walks sampling process. This approach enables GEMimp to learn nuanced, low-dimensional representations of each taxonomic unit, facilitating the reconstruction of their similarity networks with unprecedented accuracy. Our comparative analysis pits GEMimp against state-of-the-art imputation methods including SAVER, MAGIC and mbImpute. The results unequivocally demonstrate that GEMimp outperforms its counterparts by achieving the highest Pearson correlation coefficient when compared to the original raw dataset. Furthermore, GEMimp shows notable proficiency in identifying significant taxa, enhancing the detection of disease-related taxa and effectively mitigating the impact of sparsity on both simulated and real-world datasets, such as those pertaining to Type 2 Diabetes (T2D) and Colorectal Cancer (CRC). These findings collectively highlight the strong effectiveness of GEMimp, allowing for better analysis on microbial data. With alleviation of sparsity issues, it could be greatly facilitated in downstream analyses and even in the field of microbiology.
Collapse
Affiliation(s)
- Ziwei Sun
- School of Mathematics and Statistics, Qingdao University, Qingdao, China.
| | - Kai Song
- School of Mathematics and Statistics, Qingdao University, Qingdao, China.
| |
Collapse
|
2
|
Wirbel J, Essex M, Forslund SK, Zeller G. A realistic benchmark for differential abundance testing and confounder adjustment in human microbiome studies. Genome Biol 2024; 25:247. [PMID: 39322959 PMCID: PMC11423519 DOI: 10.1186/s13059-024-03390-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2023] [Accepted: 09/06/2024] [Indexed: 09/27/2024] Open
Abstract
BACKGROUND In microbiome disease association studies, it is a fundamental task to test which microbes differ in their abundance between groups. Yet, consensus on suitable or optimal statistical methods for differential abundance testing is lacking, and it remains unexplored how these cope with confounding. Previous differential abundance benchmarks relying on simulated datasets did not quantitatively evaluate the similarity to real data, which undermines their recommendations. RESULTS Our simulation framework implants calibrated signals into real taxonomic profiles, including signals mimicking confounders. Using several whole meta-genome and 16S rRNA gene amplicon datasets, we validate that our simulated data resembles real data from disease association studies much more than in previous benchmarks. With extensively parametrized simulations, we benchmark the performance of nineteen differential abundance methods and further evaluate the best ones on confounded simulations. Only classic statistical methods (linear models, the Wilcoxon test, t-test), limma, and fastANCOM properly control false discoveries at relatively high sensitivity. When additionally considering confounders, these issues are exacerbated, but we find that adjusted differential abundance testing can effectively mitigate them. In a large cardiometabolic disease dataset, we showcase that failure to account for covariates such as medication causes spurious association in real-world applications. CONCLUSIONS Tight error control is critical for microbiome association studies. The unsatisfactory performance of many differential abundance methods and the persistent danger of unchecked confounding suggest these contribute to a lack of reproducibility among such studies. We have open-sourced our simulation and benchmarking software to foster a much-needed consolidation of statistical methodology for microbiome research.
Collapse
Affiliation(s)
- Jakob Wirbel
- Structural and Computational Biology Unit (SCB), European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
| | - Morgan Essex
- Experimental and Clinical Research Center (ECRC), a cooperation of the Max-Delbrück Center and Charité-Universitätsmedizin, Berlin, Germany
- Max-Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, Germany
- Charité-Universitätsmedizin Berlin (a corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin), Berlin, Germany
| | - Sofia Kirke Forslund
- Structural and Computational Biology Unit (SCB), European Molecular Biology Laboratory (EMBL), Heidelberg, Germany.
- Experimental and Clinical Research Center (ECRC), a cooperation of the Max-Delbrück Center and Charité-Universitätsmedizin, Berlin, Germany.
- Max-Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, Germany.
- Charité-Universitätsmedizin Berlin (a corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin), Berlin, Germany.
- German Center for Cardiovascular Research (DZHK), Partner Site Berlin, Berlin, Germany.
| | - Georg Zeller
- Structural and Computational Biology Unit (SCB), European Molecular Biology Laboratory (EMBL), Heidelberg, Germany.
- Center for Infectious Diseases (LUCID), Leiden University, Leiden University Medical Center (LUMC), Leiden, Netherlands.
- Center for Microbiome Analyses and Therapeutics (CMAT), Leiden University Medical Center, Leiden, Netherlands.
| |
Collapse
|
3
|
Chi J, Ye J, Zhou Y. A GLM-based zero-inflated generalized Poisson factor model for analyzing microbiome data. Front Microbiol 2024; 15:1394204. [PMID: 38873138 PMCID: PMC11173601 DOI: 10.3389/fmicb.2024.1394204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Accepted: 05/20/2024] [Indexed: 06/15/2024] Open
Abstract
Motivation High-throughput sequencing technology facilitates the quantitative analysis of microbial communities, improving the capacity to investigate the associations between the human microbiome and diseases. Our primary motivating application is to explore the association between gut microbes and obesity. The complex characteristics of microbiome data, including high dimensionality, zero inflation, and over-dispersion, pose new statistical challenges for downstream analysis. Results We propose a GLM-based zero-inflated generalized Poisson factor analysis (GZIGPFA) model to analyze microbiome data with complex characteristics. The GZIGPFA model is based on a zero-inflated generalized Poisson (ZIGP) distribution for modeling microbiome count data. A link function between the generalized Poisson rate and the probability of excess zeros is established within the generalized linear model (GLM) framework. The latent parameters of the GZIGPFA model constitute a low-rank matrix comprising a low-dimensional score matrix and a loading matrix. An alternating maximum likelihood algorithm is employed to estimate the unknown parameters, and cross-validation is utilized to determine the rank of the model in this study. The proposed GZIGPFA model demonstrates superior performance and advantages through comprehensive simulation studies and real data applications.
Collapse
Affiliation(s)
- Jinling Chi
- School of Mathematics and Statistics, Xidian University, Xi'an, China
| | - Jimin Ye
- School of Mathematics and Statistics, Xidian University, Xi'an, China
| | - Ying Zhou
- School of Mathematical Sciences, Heilongjiang University, Harbin, China
| |
Collapse
|
4
|
Cho EJ, Kim B, Yu SJ, Hong SK, Choi Y, Yi NJ, Lee KW, Suh KS, Yoon JH, Park T. Urinary microbiome-based metagenomic signature for the noninvasive diagnosis of hepatocellular carcinoma. Br J Cancer 2024; 130:970-975. [PMID: 38278977 PMCID: PMC10951239 DOI: 10.1038/s41416-024-02582-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Revised: 01/08/2024] [Accepted: 01/10/2024] [Indexed: 01/28/2024] Open
Abstract
BACKGROUND Gut microbial dysbiosis is implicated in chronic liver disease and hepatocellular carcinoma (HCC), but the role of microbiomes from various body sites remains unexplored. We assessed disease-specific alterations in the urinary microbiome in HCC patients, investigating their potential as diagnostic biomarkers. METHODS We performed cross-sectional analyses of urine samples from 471 HCC patients and 397 healthy controls and validated the results in an independent cohort of 164 HCC patients and 164 healthy controls. Urinary microbiomes were analyzed by 16S rRNA gene sequencing. A microbial marker-based model distinguishing HCC from controls was built based on logistic regression, and its performance was tested. RESULTS Microbial diversity was significantly reduced in the HCC patients compared with the controls. There were significant differences in the abundances of various bacteria correlated with HCC, thus defining a urinary microbiome-derived signature of HCC. We developed nine HCC-associated genera-based models with robust diagnostic accuracy (area under the curve [AUC], 0.89; balanced accuracy, 81.2%). In the validation, this model detected HCC with an AUC of 0.94 and an accuracy of 88.4%. CONCLUSIONS The urinary microbiome might be a potential biomarker for the detection of HCC. Further clinical testing and validation of these results are needed in prospective studies.
Collapse
Affiliation(s)
- Eun Ju Cho
- Department of Internal Medicine and Liver Research Institute, Seoul National University College of Medicine, Seoul, 03080, Korea
| | - Boram Kim
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 08826, Korea
| | - Su Jong Yu
- Department of Internal Medicine and Liver Research Institute, Seoul National University College of Medicine, Seoul, 03080, Korea
| | - Suk Kyun Hong
- Department of Surgery, Seoul National University College of Medicine, Seoul, 03080, Korea
| | - YoungRok Choi
- Department of Surgery, Seoul National University College of Medicine, Seoul, 03080, Korea
| | - Nam-Joon Yi
- Department of Surgery, Seoul National University College of Medicine, Seoul, 03080, Korea
| | - Kwang-Woong Lee
- Department of Surgery, Seoul National University College of Medicine, Seoul, 03080, Korea
| | - Kyung-Suk Suh
- Department of Surgery, Seoul National University College of Medicine, Seoul, 03080, Korea
| | - Jung-Hwan Yoon
- Department of Internal Medicine and Liver Research Institute, Seoul National University College of Medicine, Seoul, 03080, Korea.
| | - Taesung Park
- Department of Statistics, Seoul National University, Seoul, 08826, Korea.
| |
Collapse
|
5
|
Cho H, Qu Y, Liu C, Tang B, Lyu R, Lin BM, Roach J, Azcarate-Peril MA, Aguiar Ribeiro A, Love MI, Divaris K, Wu D. Comprehensive evaluation of methods for differential expression analysis of metatranscriptomics data. Brief Bioinform 2023; 24:bbad279. [PMID: 37738402 PMCID: PMC10516371 DOI: 10.1093/bib/bbad279] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 06/23/2023] [Accepted: 07/18/2023] [Indexed: 09/24/2023] Open
Abstract
Understanding the function of the human microbiome is important but the development of statistical methods specifically for the microbial gene expression (i.e. metatranscriptomics) is in its infancy. Many currently employed differential expression analysis methods have been designed for different data types and have not been evaluated in metatranscriptomics settings. To address this gap, we undertook a comprehensive evaluation and benchmarking of 10 differential analysis methods for metatranscriptomics data. We used a combination of real and simulated data to evaluate performance (i.e. type I error, false discovery rate and sensitivity) of the following methods: log-normal (LN), logistic-beta (LB), MAST, DESeq2, metagenomeSeq, ANCOM-BC, LEfSe, ALDEx2, Kruskal-Wallis and two-part Kruskal-Wallis. The simulation was informed by supragingival biofilm microbiome data from 300 preschool-age children enrolled in a study of childhood dental disease (early childhood caries, ECC), whereas validations were sought in two additional datasets from the ECC study and an inflammatory bowel disease study. The LB test showed the highest sensitivity in both small and large samples and reasonably controlled type I error. Contrarily, MAST was hampered by inflated type I error. Upon application of the LN and LB tests in the ECC study, we found that genes C8PHV7 and C8PEV7, harbored by the lactate-producing Campylobacter gracilis, had the strongest association with childhood dental disease. This comprehensive model evaluation offers practical guidance for selection of appropriate methods for rigorous analyses of differential expression in metatranscriptomics. Selection of an optimal method increases the possibility of detecting true signals while minimizing the chance of claiming false ones.
Collapse
Affiliation(s)
- Hunyong Cho
- Department of Biostatistics, University of North Carolina, Chapel Hill, NC, United States
| | - Yixiang Qu
- Department of Biostatistics, University of North Carolina, Chapel Hill, NC, United States
| | - Chuwen Liu
- Department of Biostatistics, University of North Carolina, Chapel Hill, NC, United States
| | - Boyang Tang
- Department of Statistics, University of Connecticut, Storrs, CT, United States
| | - Ruiqi Lyu
- School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States
| | - Bridget M Lin
- Department of Biostatistics, University of North Carolina, Chapel Hill, NC, United States
| | - Jeffrey Roach
- Research Computing, University of North Carolina, Chapel Hill, NC, United States
| | - M Andrea Azcarate-Peril
- Department of Medicine and Nutrition, University of North Carolina, Chapel Hill, NC, United States
| | - Apoena Aguiar Ribeiro
- Division of Diagnostic Sciences, University of North Carolina, Chapel Hill, NC, United States
| | - Michael I Love
- Department of Biostatistics, University of North Carolina, Chapel Hill, NC, United States
- Department of Genetics, University of North Carolina, Chapel Hill, NC, United States
| | - Kimon Divaris
- Division of Pediatric and Public Health, University of North Carolina, Chapel Hill, NC, United States
- Department of Epidemiology, University of North Carolina, Chapel Hill, NC, United States
| | - Di Wu
- Department of Biostatistics, University of North Carolina, Chapel Hill, NC, United States
- Division of Oral and Craniofacial Health Sciences, Adam School of Dentistry, University of North Carolina, Chapel Hill, NC, United States
- Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC, United States
| |
Collapse
|
6
|
Yin X, Martineau C, Samad A, Fenton NJ. Out of site, out of mind: Changes in feather moss phyllosphere microbiota in mine offsite boreal landscapes. Front Microbiol 2023; 14:1148157. [PMID: 37089542 PMCID: PMC10113616 DOI: 10.3389/fmicb.2023.1148157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2023] [Accepted: 03/14/2023] [Indexed: 04/07/2023] Open
Abstract
Plant-microbe interactions play a crucial role in maintaining biodiversity and ecological services in boreal forest biomes. Mining for minerals, and especially the emission of heavy metal-enriched dust from mine sites, is a potential threat to biodiversity in offsite landscapes. Understanding the impacts of mining on surrounding phyllosphere microbiota is especially lacking. To investigate this, we characterized bacterial and fungal communities in the phyllosphere of feather moss Pleurozium schreberi (Brid). Mitt in boreal landscapes near six gold mine sites at different stages of the mine lifecycle. We found that (1) both mining stage and ecosystem type are drivers of the phyllosphere microbial community structure in mine offsite landscapes; (2) Bacterial alpha diversity is more sensitive than fungal alpha diversity to mining stage, while beta diversity of both groups is impacted; (3) mixed and deciduous forests have a higher alpha diversity and a distinct microbial community structure when compared to coniferous and open canopy ecosystems; (4) the strongest effects are detectable within 0.2 km from operating mines. These results confirmed the presence of offsite effects of mine sites on the phyllosphere microbiota in boreal forests, as well as identified mining stage and ecosystem type as drivers of these effects. Furthermore, the footprint was quantified at 0.2 km, providing a reference distance within which mining companies and policy makers should pay more attention during ecological assessment and for the development of mitigation strategies. Further studies are needed to assess how these offsite effects of mines affect the functioning of boreal ecosystems.
Collapse
Affiliation(s)
- Xiangbo Yin
- NSERC-UQAT Industrial Chair in Northern Biodiversity in a Mining Context, Rouyn-Noranda, QC, Canada
- Centre d’Étude de la Forêt, Institut de Recherche sur les Forêts (IRF), Université du Québec en Abitibi-Témiscamingue (UQAT), Rouyn-Noranda, QC, Canada
- *Correspondence: Xiangbo Yin,
| | - Christine Martineau
- NSERC-UQAT Industrial Chair in Northern Biodiversity in a Mining Context, Rouyn-Noranda, QC, Canada
- Natural Resources Canada, Canadian Forest Service, Laurentian Forestry Centre, Quebec City, QC, Canada
| | - Abdul Samad
- Natural Resources Canada, Canadian Forest Service, Laurentian Forestry Centre, Quebec City, QC, Canada
| | - Nicole J. Fenton
- NSERC-UQAT Industrial Chair in Northern Biodiversity in a Mining Context, Rouyn-Noranda, QC, Canada
- Centre d’Étude de la Forêt, Institut de Recherche sur les Forêts (IRF), Université du Québec en Abitibi-Témiscamingue (UQAT), Rouyn-Noranda, QC, Canada
| |
Collapse
|
7
|
Dousti Mousavi N, Yang J, Aldirawi H. Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data. Genes (Basel) 2023; 14:403. [PMID: 36833330 PMCID: PMC9956208 DOI: 10.3390/genes14020403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2022] [Revised: 01/19/2023] [Accepted: 01/31/2023] [Indexed: 02/09/2023] Open
Abstract
Sparse data with a high portion of zeros arise in various disciplines. Modeling sparse high-dimensional data is a challenging and growing research area. In this paper, we provide statistical methods and tools for analyzing sparse data in a fairly general and complex context. We utilize two real scientific applications as illustrations, including a longitudinal vaginal microbiome data and a high dimensional gene expression data. We recommend zero-inflated model selections and significance tests to identify the time intervals when the pregnant and non-pregnant groups of women are significantly different in terms of Lactobacillus species. We apply the same techniques to select the best 50 genes out of 2426 sparse gene expression data. The classification based on our selected genes achieves 100% prediction accuracy. Furthermore, the first four principal components based on the selected genes can explain as high as 83% of the model variability.
Collapse
Affiliation(s)
- Niloufar Dousti Mousavi
- Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, IL 60607, USA
| | - Jie Yang
- Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, IL 60607, USA
| | - Hani Aldirawi
- Department of Mathematics, California State University—San Bernardino, San Bernardino, CA 92407, USA
| |
Collapse
|
8
|
Mildau K, te Beest DE, Engel B, Gort G, Lambert J, Swinkels SN, van Eeuwijk F. Pairwise ratio-based differential abundance analysis of infant microbiome 16S sequencing data. NAR Genom Bioinform 2023; 5:lqad001. [PMID: 36685726 PMCID: PMC9853100 DOI: 10.1093/nargab/lqad001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 10/25/2022] [Accepted: 01/18/2023] [Indexed: 01/22/2023] Open
Abstract
Differential abundance analysis of infant 16S microbial sequencing data is complicated by challenging data properties, including high sparsity, extreme dispersion and the relative nature of the information contained within the data. In this study, we propose a pairwise ratio analysis that uses the compositional data analysis principle of subcompositional coherence and merges it with a beta-binomial regression model. The resulting method provides a flexible and easily interpretable approach to infant 16S sequencing data differential abundance analysis that does not require zero imputation. We evaluate the proposed method using infant 16S data from clinical trials and demonstrate that the proposed method has the power to detect differences, and demonstrate how its results can be used to gain insights. We further evaluate the method using data-inspired simulations and compare its power against related methods. Our results indicate that power is high for pairwise differential abundance analysis of taxon pairs that have a large abundance. In contrast, results for sparse taxon pairs show a decrease in power and substantial variability in method performance. While our method shows promising performance on well-measured subcompositions, we advise strong filtering steps in order to avoid excessive numbers of underpowered comparisons in practical applications.
Collapse
Affiliation(s)
| | | | - Bas Engel
- Biometris, Wageningen University & Research, 6700 HB Wageningen, The Netherlands
| | - Gerrit Gort
- Biometris, Wageningen University & Research, 6700 HB Wageningen, The Netherlands
| | - Jolanda Lambert
- Danone Nutricia Research, Uppsalalaan 12, 3584 CT Utrecht, The Netherlands
| | | | - Fred A van Eeuwijk
- Biometris, Wageningen University & Research, 6700 HB Wageningen, The Netherlands
| |
Collapse
|
9
|
Abstract
Since advances in next-generation sequencing (NGS) technique enabled to investigate uncultured microbiota and their genomes in unbiased manner, many microbiome researches have been reporting strong evidences for close links of microbiome to human health and disease. Bioinformatic and statistical analysis of NGS-based microbiome data are essential components in those microbiome researches to explore the complex composition of microbial community and understand the functions of community members in relation to host and environment. This chapter introduces bioinformatic analysis methods that generate taxonomy and functional feature count table along with phylogenetic tree from raw NGS microbiome data and then introduce statistical methods and machine learning approaches for analyzing the outputs of the bioinformatic analysis to infer the biodiversity of a microbial community and unravel host-microbiome association. Understanding the advantages and limitations of the analysis methods will help readers use the methods correctly in microbiome data analysis and may give a new opportunity to develop new analytic techniques for microbiome research.
Collapse
Affiliation(s)
- Youngchul Kim
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, USA.
| |
Collapse
|
10
|
Jurburg SD, Buscot F, Chatzinotas A, Chaudhari NM, Clark AT, Garbowski M, Grenié M, Hom EFY, Karakoç C, Marr S, Neumann S, Tarkka M, van Dam NM, Weinhold A, Heintz-Buschart A. The community ecology perspective of omics data. MICROBIOME 2022; 10:225. [PMID: 36510248 PMCID: PMC9746134 DOI: 10.1186/s40168-022-01423-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/03/2022] [Accepted: 11/10/2022] [Indexed: 06/17/2023]
Abstract
The measurement of uncharacterized pools of biological molecules through techniques such as metabarcoding, metagenomics, metatranscriptomics, metabolomics, and metaproteomics produces large, multivariate datasets. Analyses of these datasets have successfully been borrowed from community ecology to characterize the molecular diversity of samples (ɑ-diversity) and to assess how these profiles change in response to experimental treatments or across gradients (β-diversity). However, sample preparation and data collection methods generate biases and noise which confound molecular diversity estimates and require special attention. Here, we examine how technical biases and noise that are introduced into multivariate molecular data affect the estimation of the components of diversity (i.e., total number of different molecular species, or entities; total number of molecules; and the abundance distribution of molecular entities). We then explore under which conditions these biases affect the measurement of ɑ- and β-diversity and highlight how novel methods commonly used in community ecology can be adopted to improve the interpretation and integration of multivariate molecular data. Video Abstract.
Collapse
Affiliation(s)
- Stephanie D Jurburg
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany.
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany.
- Institute of Biology, Leipzig University, Leipzig, Germany.
| | - François Buscot
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
- Department of Soil Ecology, Helmholtz Centre for Environmental Research- UFZ, Halle, Germany
| | - Antonis Chatzinotas
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
- Institute of Biology, Leipzig University, Leipzig, Germany
| | - Narendrakumar M Chaudhari
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
- Institute of Biodiversity, Friedrich Schiller University, Jena, Germany
| | - Adam T Clark
- Institute of Biology, University of Graz, Graz, Austria
| | - Magda Garbowski
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
- Department of Botany, University of Wyoming, Wyoming, USA
| | - Matthias Grenié
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
- Institute of Biology, Leipzig University, Leipzig, Germany
| | - Erik F Y Hom
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
- Department of Biology and Center for Biodiversity and Conservation Research, University of Mississippi, Oxford, Mississippi, USA
| | - Canan Karakoç
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
- Department of Biology, Indiana University, Indiana, USA
| | - Susanne Marr
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
- Institute of Biology, Geobotany and Botanical Garden, Martin Luther University Halle Wittenberg, Halle, Germany
- Leibniz Institute of Plant Biochemistry, Bioinformatics and Scientific Data, Halle, Germany
| | - Steffen Neumann
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
- Leibniz Institute of Plant Biochemistry, Bioinformatics and Scientific Data, Halle, Germany
| | - Mika Tarkka
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
- Department of Soil Ecology, Helmholtz Centre for Environmental Research- UFZ, Halle, Germany
| | - Nicole M van Dam
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
- Institute of Biodiversity, Friedrich Schiller University, Jena, Germany
- Leibniz Institute of Vegetable and Ornamental Crops (IGZ), Großbeeren, Germany
| | - Alexander Weinhold
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
- Institute of Biodiversity, Friedrich Schiller University, Jena, Germany
| | - Anna Heintz-Buschart
- Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, Netherlands
| |
Collapse
|
11
|
Hu J, Wang C, Blaser MJ, Li H. Joint modeling of zero-inflated longitudinal proportions and time-to-event data with application to a gut microbiome study. Biometrics 2022; 78:1686-1698. [PMID: 34213763 PMCID: PMC8720317 DOI: 10.1111/biom.13515] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2020] [Revised: 05/26/2021] [Accepted: 06/17/2021] [Indexed: 12/30/2022]
Abstract
Recent studies have suggested that the temporal dynamics of the human microbiome may have associations with human health and disease. An increasing number of longitudinal microbiome studies, which record time to disease onset, aim to identify candidate microbes as biomarkers for prognosis. Owing to the ultra-skewness and sparsity of microbiome proportion (relative abundance) data, directly applying traditional statistical methods may result in substantial power loss or spurious inferences. We propose a novel joint modeling framework [JointMM], which is comprised of two sub-models: a longitudinal sub-model called zero-inflated scaled-beta generalized linear mixed-effects regression to depict the temporal structure of microbial proportions among subjects; and a survival sub-model to characterize the occurrence of an event and its relationship with the longitudinal microbiome proportions. JointMM is specifically designed to handle the zero-inflated and highly skewed longitudinal microbial proportion data and examine whether the temporal pattern of microbial presence and/or the nonzero microbial proportions are associated with differences in the time to an event. The longitudinal sub-model of JointMM also provides the capacity to investigate how the (time-varying) covariates are related to the temporal microbial presence/absence patterns and/or the changing trend in nonzero proportions. Comprehensive simulations and real data analyses are used to assess the statistical efficiency and interpretability of JointMM.
Collapse
Affiliation(s)
- Jiyuan Hu
- Division of Biostatistics, Department of Population Health, New York University Grossman School of Medicine, New York, NY 10016, U.S.A
| | - Chan Wang
- Division of Biostatistics, Department of Population Health, New York University Grossman School of Medicine, New York, NY 10016, U.S.A
| | - Martin J. Blaser
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, NJ 08854, U.S.A
| | - Huilin Li
- Division of Biostatistics, Department of Population Health, New York University Grossman School of Medicine, New York, NY 10016, U.S.A
| |
Collapse
|
12
|
Carson TL, Buro AW, Miller D, Peña A, Ard JD, Lampe JW, Yi N, Lefkowitz E, William VDP, Morrow C, Wilson L, Barnes S, Demark-Wahnefried W. Rationale and study protocol for a randomized controlled feeding study to determine the structural- and functional-level effects of diet-specific interventions on the gut microbiota of non-Hispanic black and white adults. Contemp Clin Trials 2022; 123:106968. [PMID: 36265810 PMCID: PMC10095329 DOI: 10.1016/j.cct.2022.106968] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2022] [Revised: 10/12/2022] [Accepted: 10/13/2022] [Indexed: 01/27/2023]
Abstract
BACKGROUND Colorectal cancer (CRC), the third leading cause of cancer-related deaths in the US, has been associated with an overrepresentation or paucity of several microbial taxa in the gut microbiota, but causality has not been established. Black men and women have among the highest CRC incidence and mortality rates of any racial/ethnic group. This study will examine the impact of the Dietary Approaches to Stop Hypertension (DASH) diet on gut microbiota and fecal metabolites associated with CRC risk. METHODS A generally healthy sample of non-Hispanic Black and white adults (n = 112) is being recruited to participate in a parallel-arm randomized controlled feeding study. Participants are randomized to receive the DASH diet or a standard American diet for a 28-day period. Fecal samples are collected weekly throughout the study to analyze changes in the gut microbiota using 16 s rRNA and selected metagenomics. Differences in bacterial alpha and beta diversity and taxa that have been associated with CRC (Bacteroides, Fusobacterium, Clostridium, Lactobacillus, Bifidobacterium, Ruminococcus, Porphyromonas, Succinivibrio) are being evaluated. Covariate measures include body mass index, comorbidities, medication history, physical activity, stress, and demographic characteristics. CONCLUSION Our findings will provide preliminary evidence for the DASH diet as an approach for cultivating a healthier gut microbiota across non-Hispanic Black and non-Hispanic White adults. These results can impact clinical, translational, and population-level approaches for modification of the gut microbiota to reduce risk of chronic diseases including CRC. TRIAL REGISTRATION This study was registered on ClinicalTrials.gov, identifier NCT04538482, on September 4, 2020 (https://clinicaltrials.gov/ct2/show/NCT04538482).
Collapse
Affiliation(s)
- Tiffany L Carson
- Department of Health Outcomes and Behavior, Moffitt Cancer Center, Tampa, FL, United States of America.
| | - Acadia W Buro
- Department of Health Outcomes and Behavior, Moffitt Cancer Center, Tampa, FL, United States of America
| | - Darci Miller
- Department of Health Outcomes and Behavior, Moffitt Cancer Center, Tampa, FL, United States of America
| | - Alissa Peña
- Department of Health Outcomes and Behavior, Moffitt Cancer Center, Tampa, FL, United States of America
| | - Jamy D Ard
- Department of Epidemiology and Prevention, Wake Forest School of Medicine, Winston-Salem, NC, United States of America
| | - Johanna W Lampe
- Public Health Science Division, Fred Hutchinson Cancer Center, Seattle, WA, United States of America
| | - Nengjun Yi
- Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, AL, United States of America
| | - Elliot Lefkowitz
- Center for Clinical and Translational Sciences, University of Alabama at Birmingham, Birmingham, AL, United States of America; Department of Microbiology, University of Alabama at Birmingham, Birmingham, AL, United States of America
| | - Van Der Pol William
- Center for Clinical and Translational Sciences, University of Alabama at Birmingham, Birmingham, AL, United States of America
| | - Casey Morrow
- Department of Cell, Developmental and Integrative Biology, University of Alabama at Birmingham, Birmingham, AL, United States of America
| | - Landon Wilson
- Department of Pharmacology and Toxicology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States of America; Targeted Metabolomics and Proteomics Laboratory, University of Alabama at Birmingham, Birmingham, AL, United States of America
| | - Stephen Barnes
- Department of Pharmacology and Toxicology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States of America; Targeted Metabolomics and Proteomics Laboratory, University of Alabama at Birmingham, Birmingham, AL, United States of America
| | - Wendy Demark-Wahnefried
- O'Neal Comprehensive Cancer Center, University of Alabama at Birmingham, Birmingham, AL, United States of America
| |
Collapse
|
13
|
Ham H, Park T. Combining p-values from various statistical methods for microbiome data. Front Microbiol 2022; 13:990870. [PMID: 36439799 PMCID: PMC9686280 DOI: 10.3389/fmicb.2022.990870] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2022] [Accepted: 10/11/2022] [Indexed: 08/30/2023] Open
Abstract
MOTIVATION In the field of microbiome analysis, there exist various statistical methods that have been developed for identifying differentially expressed features, that account for the overdispersion and the high sparsity of microbiome data. However, due to the differences in statistical models or test formulations, it is quite often to have inconsistent significance results across statistical methods, that makes it difficult to determine the importance of microbiome taxa. Thus, it is practically important to have the integration of the result from all statistical methods to determine the importance of microbiome taxa. A standard meta-analysis is a powerful tool for integrative analysis and it provides a summary measure by combining p-values from various statistical methods. While there are many meta-analyses available, it is not easy to choose the best meta-analysis that is the most suitable for microbiome data. RESULTS In this study, we investigated which meta-analysis method most adequately represents the importance of microbiome taxa. We considered Fisher's method, minimum value of p method, Simes method, Stouffer's method, Kost method, and Cauchy combination test. Through simulation studies, we showed that Cauchy combination test provides the best combined value of p in the sense that it performed the best among the examined methods while controlling the type 1 error rates. Furthermore, it produced high rank similarity with the true ranks. Through the real data application of colorectal cancer microbiome data, we demonstrated that the most highly ranked microbiome taxa by Cauchy combination test have been reported to be associated with colorectal cancer.
Collapse
Affiliation(s)
- Hyeonjung Ham
- Interdisciplinary Program of Bioinformatics, Seoul National University, Seoul, South Korea
| | - Taesung Park
- Interdisciplinary Program of Bioinformatics, Seoul National University, Seoul, South Korea
- Departement of Statistics, Seoul National University, Seoul, South Korea
| |
Collapse
|
14
|
Dai W, Li C, Li T, Hu J, Zhang H. Super-taxon in human microbiome are identified to be associated with colorectal cancer. BMC Bioinformatics 2022; 23:243. [PMID: 35729515 PMCID: PMC9215102 DOI: 10.1186/s12859-022-04786-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Accepted: 06/06/2022] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND Microbial communities in the human body, also known as human microbiota, impact human health, such as colorectal cancer (CRC). However, the different roles that microbial communities play in healthy and disease hosts remain largely unknown. The microbial communities are typically recorded through the taxa counts of operational taxonomic units (OTUs). The sparsity and high correlations among OTUs pose major challenges for understanding the microbiota-disease relation. Furthermore, the taxa data are structured in the sense that OTUs are related evolutionarily by a hierarchical structure. RESULTS In this study, we borrow the idea of super-variant from statistical genetics, and propose a new concept called super-taxon to exploit hierarchical structure of taxa for microbiome studies, which is essentially a combination of taxonomic units. Specifically, we model a genus which consists of a set of OTUs at low hierarchy and is designed to reflect both marginal and joint effects of OTUs associated with the risk of CRC to address these issues. We first demonstrate the power of super-taxon in detecting highly correlated OTUs. Then, we identify CRC-associated OTUs in two publicly available datasets via a discovery-validation procedure. Specifically, four species of two genera are found to be associated with CRC: Parvimonas micra, Parvimonas sp., Peptostreptococcus stomatis, and Peptostreptococcus anaerobius. More importantly, for the first time, we report the joint effect of Parvimonas micra and Parvimonas sp. (p = 0.0084) as well as that of Peptostrepto-coccus stomatis and Peptostreptococcus anaerobius (p = 8.21e-06) on CRC. The proposed approach provides a novel and useful tool for identifying disease-related microbes by taking the hierarchical structure of taxa into account and further sheds new lights on their potential joint effects as a community in disease development. CONCLUSIONS Our work shows that proposed approaches are effective to study the microbiota-disease relation taking into account for the sparsity, hierarchical and correlated structure among microbes.
Collapse
Affiliation(s)
- Wei Dai
- Department of Biostatistics, Yale University School of Public Health, 300 George Street, Ste 523, New Haven, CT, 06511, USA
| | - Cai Li
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN, 38105, USA
| | - Ting Li
- Department of Applied Mathematics, The Hong Kong Polytechnic University, Hong Kong, China
| | - Jianchang Hu
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany
| | - Heping Zhang
- Department of Biostatistics, Yale University School of Public Health, 300 George Street, Ste 523, New Haven, CT, 06511, USA.
| |
Collapse
|
15
|
Wu Q, O’Malley J, Datta S, Gharaibeh RZ, Jobin C, Karagas MR, Coker MO, Hoen AG, Christensen BC, Madan JC, Li Z. MarZIC: A Marginal Mediation Model for Zero-Inflated Compositional Mediators with Applications to Microbiome Data. Genes (Basel) 2022; 13:1049. [PMID: 35741811 PMCID: PMC9223163 DOI: 10.3390/genes13061049] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Revised: 06/06/2022] [Accepted: 06/07/2022] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND The human microbiome can contribute to pathogeneses of many complex diseases by mediating disease-leading causal pathways. However, standard mediation analysis methods are not adequate to analyze the microbiome as a mediator due to the excessive number of zero-valued sequencing reads in the data and that the relative abundances have to sum to one. The two main challenges raised by the zero-inflated data structure are: (a) disentangling the mediation effect induced by the point mass at zero; and (b) identifying the observed zero-valued data points that are not zero (i.e., false zeros). METHODS We develop a novel marginal mediation analysis method under the potential-outcomes framework to address the issues. We also show that the marginal model can account for the compositional structure of microbiome data. RESULTS The mediation effect can be decomposed into two components that are inherent to the two-part nature of zero-inflated distributions. With probabilistic models to account for observing zeros, we also address the challenge with false zeros. A comprehensive simulation study and the application in a real microbiome study showcase our approach in comparison with existing approaches. CONCLUSIONS When analyzing the zero-inflated microbiome composition as the mediators, MarZIC approach has better performance than standard causal mediation analysis approaches and existing competing approach.
Collapse
Affiliation(s)
- Quran Wu
- Department of Biostatistics, University of Florida, Gainesville, FL 32611, USA; (Q.W.); (S.D.)
| | - James O’Malley
- The Dartmouth Institute, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA;
| | - Susmita Datta
- Department of Biostatistics, University of Florida, Gainesville, FL 32611, USA; (Q.W.); (S.D.)
| | - Raad Z. Gharaibeh
- Department of Medicine, University of Florida, Gainesville, FL 32611, USA; (R.Z.G.); (C.J.)
| | - Christian Jobin
- Department of Medicine, University of Florida, Gainesville, FL 32611, USA; (R.Z.G.); (C.J.)
| | - Margaret R. Karagas
- Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA; (M.R.K.); (M.O.C.); (A.G.H.); (B.C.C.); (J.C.M.)
| | - Modupe O. Coker
- Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA; (M.R.K.); (M.O.C.); (A.G.H.); (B.C.C.); (J.C.M.)
| | - Anne G. Hoen
- Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA; (M.R.K.); (M.O.C.); (A.G.H.); (B.C.C.); (J.C.M.)
| | - Brock C. Christensen
- Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA; (M.R.K.); (M.O.C.); (A.G.H.); (B.C.C.); (J.C.M.)
| | - Juliette C. Madan
- Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA; (M.R.K.); (M.O.C.); (A.G.H.); (B.C.C.); (J.C.M.)
| | - Zhigang Li
- Department of Biostatistics, University of Florida, Gainesville, FL 32611, USA; (Q.W.); (S.D.)
| |
Collapse
|
16
|
Czech L, Stamatakis A, Dunthorn M, Barbera P. Metagenomic Analysis Using Phylogenetic Placement-A Review of the First Decade. FRONTIERS IN BIOINFORMATICS 2022; 2:871393. [PMID: 36304302 PMCID: PMC9580882 DOI: 10.3389/fbinf.2022.871393] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Accepted: 04/11/2022] [Indexed: 12/20/2022] Open
Abstract
Phylogenetic placement refers to a family of tools and methods to analyze, visualize, and interpret the tsunami of metagenomic sequencing data generated by high-throughput sequencing. Compared to alternative (e. g., similarity-based) methods, it puts metabarcoding sequences into a phylogenetic context using a set of known reference sequences and taking evolutionary history into account. Thereby, one can increase the accuracy of metagenomic surveys and eliminate the requirement for having exact or close matches with existing sequence databases. Phylogenetic placement constitutes a valuable analysis tool per se, but also entails a plethora of downstream tools to interpret its results. A common use case is to analyze species communities obtained from metagenomic sequencing, for example via taxonomic assignment, diversity quantification, sample comparison, and identification of correlations with environmental variables. In this review, we provide an overview over the methods developed during the first 10 years. In particular, the goals of this review are 1) to motivate the usage of phylogenetic placement and illustrate some of its use cases, 2) to outline the full workflow, from raw sequences to publishable figures, including best practices, 3) to introduce the most common tools and methods and their capabilities, 4) to point out common placement pitfalls and misconceptions, 5) to showcase typical placement-based analyses, and how they can help to analyze, visualize, and interpret phylogenetic placement data.
Collapse
Affiliation(s)
- Lucas Czech
- Department of Plant Biology, Carnegie Institution for Science, Stanford, CA, United States
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Micah Dunthorn
- Natural History Museum, University of Oslo, Oslo, Norway
| | | |
Collapse
|
17
|
Kim SA, Kang N, Park T. Hierarchical Structured Component Analysis for Microbiome Data Using Taxonomy Assignments. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1302-1312. [PMID: 33211665 DOI: 10.1109/tcbb.2020.3039326] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The recent advent of high-throughput sequencing technology has enabled us to study the associations between human microbiome and diseases. The DNA sequences of microbiome samples are clustered as operational taxonomic units (OTUs) according to their similarity. The OTU table containing counts of OTUs present in each sample is used to measure correlations between OTUs and disease status and find key microbes for prediction of the disease status. Various statistical methods have been proposed for such microbiome data analysis. However, none of these methods reflects the hierarchy of taxonomy information. In this paper, we propose a hierarchical structural component model for microbiome data (HisCoM-microb) using taxonomy information as well as OTU table data. The proposed HisCoM-microb consists of two layers: one for OTUs and the other for taxa at the higher taxonomy level. Then we calculate simultaneously coefficient estimates of OTUs and taxa of the two layers inserted in the hierarchical model. Through this analysis, we can infer the association between taxa or OTUs and disease status, considering the impact of taxonomic structure on disease status. Both simulation study and real microbiome data analysis show that HisCoM-microb can successfully reveal the relations between each taxon and disease status and identify the key OTUs of the disease at the same time.
Collapse
|
18
|
Mallick H, Rahnavard A, McIver LJ, Ma S, Zhang Y, Nguyen LH, Tickle TL, Weingart G, Ren B, Schwager EH, Chatterjee S, Thompson KN, Wilkinson JE, Subramanian A, Lu Y, Waldron L, Paulson JN, Franzosa EA, Bravo HC, Huttenhower C. Multivariable association discovery in population-scale meta-omics studies. PLoS Comput Biol 2021; 17:e1009442. [PMID: 34784344 PMCID: PMC8714082 DOI: 10.1371/journal.pcbi.1009442] [Citation(s) in RCA: 785] [Impact Index Per Article: 261.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2021] [Revised: 12/28/2021] [Accepted: 09/09/2021] [Indexed: 12/13/2022] Open
Abstract
It is challenging to associate features such as human health outcomes, diet, environmental conditions, or other metadata to microbial community measurements, due in part to their quantitative properties. Microbiome multi-omics are typically noisy, sparse (zero-inflated), high-dimensional, extremely non-normal, and often in the form of count or compositional measurements. Here we introduce an optimized combination of novel and established methodology to assess multivariable association of microbial community features with complex metadata in population-scale observational studies. Our approach, MaAsLin 2 (Microbiome Multivariable Associations with Linear Models), uses generalized linear and mixed models to accommodate a wide variety of modern epidemiological studies, including cross-sectional and longitudinal designs, as well as a variety of data types (e.g., counts and relative abundances) with or without covariates and repeated measurements. To construct this method, we conducted a large-scale evaluation of a broad range of scenarios under which straightforward identification of meta-omics associations can be challenging. These simulation studies reveal that MaAsLin 2's linear model preserves statistical power in the presence of repeated measures and multiple covariates, while accounting for the nuances of meta-omics features and controlling false discovery. We also applied MaAsLin 2 to a microbial multi-omics dataset from the Integrative Human Microbiome (HMP2) project which, in addition to reproducing established results, revealed a unique, integrated landscape of inflammatory bowel diseases (IBD) across multiple time points and omics profiles.
Collapse
Affiliation(s)
- Himel Mallick
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- The Broad Institute, Cambridge, Massachusetts, United States of America
| | - Ali Rahnavard
- Computational Biology Institute, Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington DC, United States of America
| | - Lauren J. McIver
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- The Broad Institute, Cambridge, Massachusetts, United States of America
| | - Siyuan Ma
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- The Broad Institute, Cambridge, Massachusetts, United States of America
| | - Yancong Zhang
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- The Broad Institute, Cambridge, Massachusetts, United States of America
| | - Long H. Nguyen
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- Clinical and Translational Epidemiology Unit, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, United States of America
- Division of Gastroenterology, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, United States of America
| | - Timothy L. Tickle
- The Broad Institute, Cambridge, Massachusetts, United States of America
| | - George Weingart
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- The Broad Institute, Cambridge, Massachusetts, United States of America
| | - Boyu Ren
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- The Broad Institute, Cambridge, Massachusetts, United States of America
| | - Emma H. Schwager
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- The Broad Institute, Cambridge, Massachusetts, United States of America
| | - Suvo Chatterjee
- Epidemiology Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Kelsey N. Thompson
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
| | - Jeremy E. Wilkinson
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
| | - Ayshwarya Subramanian
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- The Broad Institute, Cambridge, Massachusetts, United States of America
| | - Yiren Lu
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
| | - Levi Waldron
- Department of Epidemiology and Biostatistics, CUNY School of Public Health, New York City, New York, United States of America
| | - Joseph N. Paulson
- Department of Biostatistics, Product Development, Genentech, Inc., South San Francisco, California, United States of America
| | - Eric A. Franzosa
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- The Broad Institute, Cambridge, Massachusetts, United States of America
| | - Hector Corrada Bravo
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
| | - Curtis Huttenhower
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- The Broad Institute, Cambridge, Massachusetts, United States of America
| |
Collapse
|
19
|
Mallick H, Rahnavard A, McIver LJ, Ma S, Zhang Y, Nguyen LH, Tickle TL, Weingart G, Ren B, Schwager EH, Chatterjee S, Thompson KN, Wilkinson JE, Subramanian A, Lu Y, Waldron L, Paulson JN, Franzosa EA, Bravo HC, Huttenhower C. Multivariable association discovery in population-scale meta-omics studies. PLoS Comput Biol 2021. [PMID: 34784344 DOI: 10.1101/2021.01.20.427420v1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/07/2023] Open
Abstract
It is challenging to associate features such as human health outcomes, diet, environmental conditions, or other metadata to microbial community measurements, due in part to their quantitative properties. Microbiome multi-omics are typically noisy, sparse (zero-inflated), high-dimensional, extremely non-normal, and often in the form of count or compositional measurements. Here we introduce an optimized combination of novel and established methodology to assess multivariable association of microbial community features with complex metadata in population-scale observational studies. Our approach, MaAsLin 2 (Microbiome Multivariable Associations with Linear Models), uses generalized linear and mixed models to accommodate a wide variety of modern epidemiological studies, including cross-sectional and longitudinal designs, as well as a variety of data types (e.g., counts and relative abundances) with or without covariates and repeated measurements. To construct this method, we conducted a large-scale evaluation of a broad range of scenarios under which straightforward identification of meta-omics associations can be challenging. These simulation studies reveal that MaAsLin 2's linear model preserves statistical power in the presence of repeated measures and multiple covariates, while accounting for the nuances of meta-omics features and controlling false discovery. We also applied MaAsLin 2 to a microbial multi-omics dataset from the Integrative Human Microbiome (HMP2) project which, in addition to reproducing established results, revealed a unique, integrated landscape of inflammatory bowel diseases (IBD) across multiple time points and omics profiles.
Collapse
Affiliation(s)
- Himel Mallick
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- The Broad Institute, Cambridge, Massachusetts, United States of America
| | - Ali Rahnavard
- Computational Biology Institute, Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington DC, United States of America
| | - Lauren J McIver
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- The Broad Institute, Cambridge, Massachusetts, United States of America
| | - Siyuan Ma
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- The Broad Institute, Cambridge, Massachusetts, United States of America
| | - Yancong Zhang
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- The Broad Institute, Cambridge, Massachusetts, United States of America
| | - Long H Nguyen
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- Clinical and Translational Epidemiology Unit, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, United States of America
- Division of Gastroenterology, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, United States of America
| | - Timothy L Tickle
- The Broad Institute, Cambridge, Massachusetts, United States of America
| | - George Weingart
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- The Broad Institute, Cambridge, Massachusetts, United States of America
| | - Boyu Ren
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- The Broad Institute, Cambridge, Massachusetts, United States of America
| | - Emma H Schwager
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- The Broad Institute, Cambridge, Massachusetts, United States of America
| | - Suvo Chatterjee
- Epidemiology Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Kelsey N Thompson
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
| | - Jeremy E Wilkinson
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
| | - Ayshwarya Subramanian
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- The Broad Institute, Cambridge, Massachusetts, United States of America
| | - Yiren Lu
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
| | - Levi Waldron
- Department of Epidemiology and Biostatistics, CUNY School of Public Health, New York City, New York, United States of America
| | - Joseph N Paulson
- Department of Biostatistics, Product Development, Genentech, Inc., South San Francisco, California, United States of America
| | - Eric A Franzosa
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- The Broad Institute, Cambridge, Massachusetts, United States of America
| | - Hector Corrada Bravo
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
| | - Curtis Huttenhower
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- The Broad Institute, Cambridge, Massachusetts, United States of America
| |
Collapse
|
20
|
Liu T, Xu P, Du Y, Lu H, Zhao H, Wang T. MZINBVA: variational approximation for multilevel zero-inflated negative-binomial models for association analysis in microbiome surveys. Brief Bioinform 2021; 23:6409694. [PMID: 34718406 DOI: 10.1093/bib/bbab443] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2021] [Revised: 09/11/2021] [Accepted: 09/28/2021] [Indexed: 01/02/2023] Open
Abstract
As our understanding of the microbiome has expanded, so has the recognition of its critical role in human health and disease, thereby emphasizing the importance of testing whether microbes are associated with environmental factors or clinical outcomes. However, many of the fundamental challenges that concern microbiome surveys arise from statistical and experimental design issues, such as the sparse and overdispersed nature of microbiome count data and the complex correlation structure among samples. For example, in the human microbiome project (HMP) dataset, the repeated observations across time points (level 1) are nested within body sites (level 2), which are further nested within subjects (level 3). Therefore, there is a great need for the development of specialized and sophisticated statistical tests. In this paper, we propose multilevel zero-inflated negative-binomial models for association analysis in microbiome surveys. We develop a variational approximation method for maximum likelihood estimation and inference. It uses optimization, rather than sampling, to approximate the log-likelihood and compute parameter estimates, provides a robust estimate of the covariance of parameter estimates and constructs a Wald-type test statistic for association testing. We evaluate and demonstrate the performance of our method using extensive simulation studies and an application to the HMP dataset. We have developed an R package MZINBVA to implement the proposed method, which is available from the GitHub repository https://github.com/liudoubletian/MZINBVA.
Collapse
Affiliation(s)
- Tiantian Liu
- SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, 800 Dongchuan RD, 200240, Shanghai, China
| | - Peirong Xu
- Department of Breast Surgery, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, 200127, Shanghai, China
| | - Yueyao Du
- Department of Biostatistics, Yale University, 60 College Stree, CT 06520, New Haven, USA.,MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, 800 Dongchuan RD, 200240, Shanghai, China
| | - Hui Lu
- SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, 800 Dongchuan RD, 200240, Shanghai, China
| | - Hongyu Zhao
- Department of Biostatistics, Yale University, 60 College Stree, CT 06520, New Haven, USA
| | - Tao Wang
- SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, 800 Dongchuan RD, 200240, Shanghai, China
| |
Collapse
|
21
|
Jiang R, Li WV, Li JJ. mbImpute: an accurate and robust imputation method for microbiome data. Genome Biol 2021; 22:192. [PMID: 34183041 PMCID: PMC8240317 DOI: 10.1186/s13059-021-02400-4] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2021] [Accepted: 06/04/2021] [Indexed: 12/22/2022] Open
Abstract
A critical challenge in microbiome data analysis is the existence of many non-biological zeros, which distort taxon abundance distributions, complicate data analysis, and jeopardize the reliability of scientific discoveries. To address this issue, we propose the first imputation method for microbiome data-mbImpute-to identify and recover likely non-biological zeros by borrowing information jointly from similar samples, similar taxa, and optional metadata including sample covariates and taxon phylogeny. We demonstrate that mbImpute improves the power of identifying disease-related taxa from microbiome data of type 2 diabetes and colorectal cancer, and mbImpute preserves non-zero distributions of taxa abundances.
Collapse
Affiliation(s)
- Ruochen Jiang
- Department of Statistics, University of California, Los Angeles, 90095-1554, CA, USA
| | - Wei Vivian Li
- Department of Statistics, University of California, Los Angeles, 90095-1554, CA, USA
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, Piscataway, 08854, NJ, USA
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, 90095-1554, CA, USA.
- Department of Human Genetics, University of California, Los Angeles, 90095-7088, CA, USA.
- Department of Computational Medicine, University of California, Los Angeles, 90095-1766, CA, USA.
- Department of Biostatistics, University of California, Los Angeles, 90095-1772, CA, USA.
| |
Collapse
|
22
|
Han Y, Baker C, Vogtmann E, Hua X, Shi J, Liu D. Modeling Longitudinal Microbiome Compositional Data: A Two-Part Linear Mixed Model with Shared Random Effects. STATISTICS IN BIOSCIENCES 2021. [DOI: 10.1007/s12561-021-09302-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
23
|
Weiser R, Rye PD, Mahenthiralingam E. Implementation of microbiota analysis in clinical trials for cystic fibrosis lung infection: Experience from the OligoG phase 2b clinical trials. J Microbiol Methods 2021; 181:106133. [PMID: 33421446 DOI: 10.1016/j.mimet.2021.106133] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Revised: 01/02/2021] [Accepted: 01/04/2021] [Indexed: 11/28/2022]
Abstract
Culture-independent microbiota analysis is widely used in research and being increasingly used in translational studies. However, methods for standardisation and application of these analyses in clinical trials are limited. Here we report the microbiota analysis that accompanied two phase 2b clinical trials of the novel, non-antibiotic therapy OligoG CF-5/20 for cystic fibrosis (CF) lung infection. Standardised protocols (DNA extraction, PCR, qPCR and 16S rRNA gene sequencing analysis) were developed for application to the Pseudomonas aeruginosa (NCT02157922) and Burkholderia cepacia complex (NCT02453789) clinical trials involving 45 and 13 adult trial participants, respectively. Microbiota analysis identified that paired sputum samples from an individual participant, taken within 2 h of each other, had reproducible bacterial diversity profiles. Although culture microbiology had identified patients as either colonised by P. aeruginosa or B. cepacia complex species at recruitment, microbiota analysis revealed patient lung infection communities were not always dominated by these key CF pathogens. Microbiota profiles were patient-specific and remained stable over the course of both clinical trials (6 sampling points over the course of 140 days). Within the Burkholderia trial, participants were infected with B. cenocepacia (n = 4), B. multivorans (n = 6), or an undetermined species (n = 3). Colonisation with either B. cenocepacia or B. multivorans influenced the overall bacterial community structure in sputum. Overall, we have shown that sputum microbiota in adults with CF is stable over a 2 h time-frame, suggesting collection of a single sample on a collection day is sufficient to capture the microbiota diversity. Despite the uniform pathogen culture-positivity status at recruitment, trial participants were highly heterogeneous in their lung microbiota. Understanding the microbiota profiles of individuals with CF ahead of future clinical trials would be beneficial in the context of patient stratification and trial design.
Collapse
Affiliation(s)
- Rebecca Weiser
- Microbiomes, Microbes and Informatics Group, Organisms and Environment Division, School of Biosciences, Cardiff University, The Sir Martin Evans Building, Museum Avenue, Cardiff, Wales, CF10 3AX, UK.
| | - Philip D Rye
- AlgiPharma AS, Industriveien 33, N-1337, Sandvika, Norway.
| | - Eshwar Mahenthiralingam
- Microbiomes, Microbes and Informatics Group, Organisms and Environment Division, School of Biosciences, Cardiff University, The Sir Martin Evans Building, Museum Avenue, Cardiff, Wales, CF10 3AX, UK.
| |
Collapse
|
24
|
Norouzi-Beirami MH, Marashi SA, Banaei-Moghaddam AM, Kavousi K. CAMAMED: a pipeline for composition-aware mapping-based analysis of metagenomic data. NAR Genom Bioinform 2021; 3:lqaa107. [PMID: 33575649 PMCID: PMC7787360 DOI: 10.1093/nargab/lqaa107] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2019] [Revised: 10/29/2020] [Accepted: 12/28/2020] [Indexed: 12/13/2022] Open
Abstract
Metagenomics is the study of genomic DNA recovered from a microbial community. Both assembly-based and mapping-based methods have been used to analyze metagenomic data. When appropriate gene catalogs are available, mapping-based methods are preferred over assembly based approaches, especially for analyzing the data at the functional level. In this study, we introduce CAMAMED as a composition-aware mapping-based metagenomic data analysis pipeline. This pipeline can analyze metagenomic samples at both taxonomic and functional profiling levels. Using this pipeline, metagenome sequences can be mapped to non-redundant gene catalogs and the gene frequency in the samples are obtained. Due to the highly compositional nature of metagenomic data, the cumulative sum-scaling method is used at both taxa and gene levels for compositional data analysis in our pipeline. Additionally, by mapping the genes to the KEGG database, annotations related to each gene can be extracted at different functional levels such as KEGG ortholog groups, enzyme commission numbers and reactions. Furthermore, the pipeline enables the user to identify potential biomarkers in case-control metagenomic samples by investigating functional differences. The source code for this software is available from https://github.com/mhnb/camamed. Also, the ready to use Docker images are available at https://hub.docker.com.
Collapse
Affiliation(s)
- Mohammad H Norouzi-Beirami
- Laboratory of Complex Biological systems and Bioinformatics (CBB), Department of Bioinformatics, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran 1417614335, Iran
| | - Sayed-Amir Marashi
- Department of Biotechnology, College of Science, University of Tehran, Tehran 1417614411, Iran
| | - Ali M Banaei-Moghaddam
- Laboratory of Genomics and Epigenomics (LGE), Department of Biochemistry, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran 1417614335, Iran
| | - Kaveh Kavousi
- Laboratory of Complex Biological systems and Bioinformatics (CBB), Department of Bioinformatics, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran 1417614335, Iran
| |
Collapse
|
25
|
Silverman JD, Roche K, Mukherjee S, David LA. Naught all zeros in sequence count data are the same. Comput Struct Biotechnol J 2020; 18:2789-2798. [PMID: 33101615 PMCID: PMC7568192 DOI: 10.1016/j.csbj.2020.09.014] [Citation(s) in RCA: 68] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2020] [Revised: 09/09/2020] [Accepted: 09/10/2020] [Indexed: 12/21/2022] Open
Abstract
Genomic studies feature multivariate count data from high-throughput DNA sequencing experiments, which often contain many zero values. These zeros can cause artifacts for statistical analyses and multiple modeling approaches have been developed in response. Here, we apply different zero-handling models to gene-expression and microbiome datasets and show models can disagree substantially in terms of identifying the most differentially expressed sequences. Next, to rationally examine how different zero handling models behave, we developed a conceptual framework outlining four types of processes that may give rise to zero values in sequence count data. Last, we performed simulations to test how zero handling models behave in the presence of these different zero generating processes. Our simulations showed that simple count models are sufficient across multiple processes, even when the true underlying process is unknown. On the other hand, a common zero handling technique known as "zero-inflation" was only suitable under a zero generating process associated with an unlikely set of biological and experimental conditions. In concert, our work here suggests several specific guidelines for developing and choosing state-of-the-art models for analyzing sparse sequence count data.
Collapse
Affiliation(s)
- Justin D Silverman
- College of Information Science and Technology, Pennsylvania State University, State College, PA 16802, United States
- Institute for Computational and Data Science, Pennsylvania State University, State College, PA 16802, United States
- Department of Medicine, Pennsylvania State University, Hershey, PA 17033, United States
| | - Kimberly Roche
- Program in Computational Biology and Bioinformatics, Duke University, Durham, NC 27708, United States
| | - Sayan Mukherjee
- Program in Computational Biology and Bioinformatics, Duke University, Durham, NC 27708, United States
- Departments of Statistical Science, Mathematics, Computer Science, Biostatistics & Bioinformatics, Duke University, Durham, NC 27708, United States
- Center for Genomic and Computational Biology, Duke University, Durham, NC 27708, United States
| | - Lawrence A David
- Program in Computational Biology and Bioinformatics, Duke University, Durham, NC 27708, United States
- Center for Genomic and Computational Biology, Duke University, Durham, NC 27708, United States
- Department of Molecular Genetics and Microbiology, Duke University, Durham, NC 27708, United States
| |
Collapse
|
26
|
Nolan-Kenney R, Wu F, Hu J, Yang L, Kelly D, Li H, Jasmine F, Kibriya MG, Parvez F, Shaheen I, Sarwar G, Ahmed A, Eunus M, Islam T, Pei Z, Ahsan H, Chen Y. The Association Between Smoking and Gut Microbiome in Bangladesh. Nicotine Tob Res 2020; 22:1339-1346. [PMID: 31794002 PMCID: PMC7364824 DOI: 10.1093/ntr/ntz220] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2019] [Accepted: 12/02/2019] [Indexed: 12/11/2022]
Abstract
INTRODUCTION Epidemiological studies that investigate alterations in the gut microbial composition associated with smoking are lacking. This study examined the composition of the gut microbiome in smokers compared with nonsmokers. AIMS AND METHODS Stool samples were collected in a cross-sectional study of 249 participants selected from the Health Effects of Arsenic Longitudinal Study in Bangladesh. Microbial DNA was extracted from the fecal samples and sequenced by 16S rRNA gene sequencing. The associations of smoking status and intensity of smoking with the relative abundance or the absence and presence of individual bacterial taxon from phylum to genus levels were examined. RESULTS The relative abundance of bacterial taxa along the Erysipelotrichi-to-Catenibacterium lineage was significantly higher in current smokers compared to never-smokers. The odds ratio comparing the mean relative abundance in current smokers with that in never-smokers was 1.91 (95% confidence interval = 1.36-2.69) for the genus Catenibacterium and 1.89 (95% confidence interval = 1.39-2.56) for the family Erysipelotrichaceae, the order Erysipelotrichale, and the class Erysipelotrichi (false discovery rate-adjusted p values = .0008-.01). A dose-response association was observed for each of these bacterial taxa. The presence of Alphaproteobacteria was significantly greater comparing current with never-smokers (odds ratio = 4.85, false discovery rate-adjusted p values = .04). CONCLUSIONS Our data in a Bangladeshi population are consistent with evidence of an association between smoking status and dosage with change in the gut bacterial composition. IMPLICATIONS This study for the first time examined the relationship between smoking and the gut microbiome composition. The data suggest that smoking status may play an important role in the composition of the gut microbiome, especially among individuals with higher levels of tobacco exposure.
Collapse
Affiliation(s)
- Rachel Nolan-Kenney
- Department of Population Health, New York University School of Medicine, New York, NY
- Department of Environmental Medicine, New York University School of Medicine, New York, NY
| | - Fen Wu
- Department of Population Health, New York University School of Medicine, New York, NY
- Department of Environmental Medicine, New York University School of Medicine, New York, NY
| | - Jiyuan Hu
- Department of Population Health, New York University School of Medicine, New York, NY
- Department of Environmental Medicine, New York University School of Medicine, New York, NY
| | - Liying Yang
- Department of Pathology, New York University School of Medicine, New York, NY
- Department of Medicine, New York University School of Medicine, New York, NY
- The Department of Veterans Affairs New York Harbor Healthcare System, New York, NY
| | - Dervla Kelly
- Health Research Institute, Graduate Entry Medical School, University of Limerick, Limerick, Ireland
| | - Huilin Li
- Department of Population Health, New York University School of Medicine, New York, NY
- Department of Environmental Medicine, New York University School of Medicine, New York, NY
| | - Farzana Jasmine
- Department of Public Health Sciences, Institute for Population and Precision Health, The University of Chicago, Chicago, IL
| | - Muhammad G Kibriya
- Department of Public Health Sciences, Institute for Population and Precision Health, The University of Chicago, Chicago, IL
| | - Faruque Parvez
- Department of Environmental Health Sciences, Mailman School of Public Health, Columbia University, New York, NY
| | - Ishrat Shaheen
- Department of Informatics, U-Chicago Research Bangladesh, Ltd., Dhaka, Bangladesh
| | - Golam Sarwar
- Department of Informatics, U-Chicago Research Bangladesh, Ltd., Dhaka, Bangladesh
| | - Alauddin Ahmed
- Department of Informatics, U-Chicago Research Bangladesh, Ltd., Dhaka, Bangladesh
| | - Mahbub Eunus
- Department of Informatics, U-Chicago Research Bangladesh, Ltd., Dhaka, Bangladesh
| | - Tariqul Islam
- Department of Health, Research & Training, U-Chicago Research Bangladesh, Ltd., Dhaka, Bangladesh
| | - Zhiheng Pei
- Department of Pathology, New York University School of Medicine, New York, NY
- Department of Medicine, New York University School of Medicine, New York, NY
- The Department of Veterans Affairs New York Harbor Healthcare System, New York, NY
| | - Habibul Ahsan
- Department of Public Health Sciences, Institute for Population and Precision Health, The University of Chicago, Chicago, IL
| | - Yu Chen
- Department of Population Health, New York University School of Medicine, New York, NY
- Department of Environmental Medicine, New York University School of Medicine, New York, NY
| |
Collapse
|
27
|
Xia Y. Correlation and association analyses in microbiome study integrating multiomics in health and disease. PROGRESS IN MOLECULAR BIOLOGY AND TRANSLATIONAL SCIENCE 2020; 171:309-491. [PMID: 32475527 DOI: 10.1016/bs.pmbts.2020.04.003] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Correlation and association analyses are one of the most widely used statistical methods in research fields, including microbiome and integrative multiomics studies. Correlation and association have two implications: dependence and co-occurrence. Microbiome data are structured as phylogenetic tree and have several unique characteristics, including high dimensionality, compositionality, sparsity with excess zeros, and heterogeneity. These unique characteristics cause several statistical issues when analyzing microbiome data and integrating multiomics data, such as large p and small n, dependency, overdispersion, and zero-inflation. In microbiome research, on the one hand, classic correlation and association methods are still applied in real studies and used for the development of new methods; on the other hand, new methods have been developed to target statistical issues arising from unique characteristics of microbiome data. Here, we first provide a comprehensive view of classic and newly developed univariate correlation and association-based methods. We discuss the appropriateness and limitations of using classic methods and demonstrate how the newly developed methods mitigate the issues of microbiome data. Second, we emphasize that concepts of correlation and association analyses have been shifted by introducing network analysis, microbe-metabolite interactions, functional analysis, etc. Third, we introduce multivariate correlation and association-based methods, which are organized by the categories of exploratory, interpretive, and discriminatory analyses and classification methods. Fourth, we focus on the hypothesis testing of univariate and multivariate regression-based association methods, including alpha and beta diversities-based, count-based, and relative abundance (or compositional)-based association analyses. We demonstrate the characteristics and limitations of each approaches. Fifth, we introduce two specific microbiome-based methods: phylogenetic tree-based association analysis and testing for survival outcomes. Sixth, we provide an overall view of longitudinal methods in analysis of microbiome and omics data, which cover standard, static, regression-based time series methods, principal trend analysis, and newly developed univariate overdispersed and zero-inflated as well as multivariate distance/kernel-based longitudinal models. Finally, we comment on current association analysis and future direction of association analysis in microbiome and multiomics studies.
Collapse
Affiliation(s)
- Yinglin Xia
- Department of Medicine, University of Illinois at Chicago, Chicago, IL, United States.
| |
Collapse
|
28
|
Eetemadi A, Rai N, Pereira BMP, Kim M, Schmitz H, Tagkopoulos I. The Computational Diet: A Review of Computational Methods Across Diet, Microbiome, and Health. Front Microbiol 2020; 11:393. [PMID: 32318028 PMCID: PMC7146706 DOI: 10.3389/fmicb.2020.00393] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2019] [Accepted: 02/26/2020] [Indexed: 12/12/2022] Open
Abstract
Food and human health are inextricably linked. As such, revolutionary impacts on health have been derived from advances in the production and distribution of food relating to food safety and fortification with micronutrients. During the past two decades, it has become apparent that the human microbiome has the potential to modulate health, including in ways that may be related to diet and the composition of specific foods. Despite the excitement and potential surrounding this area, the complexity of the gut microbiome, the chemical composition of food, and their interplay in situ remains a daunting task to fully understand. However, recent advances in high-throughput sequencing, metabolomics profiling, compositional analysis of food, and the emergence of electronic health records provide new sources of data that can contribute to addressing this challenge. Computational science will play an essential role in this effort as it will provide the foundation to integrate these data layers and derive insights capable of revealing and understanding the complex interactions between diet, gut microbiome, and health. Here, we review the current knowledge on diet-health-gut microbiota, relevant data sources, bioinformatics tools, machine learning capabilities, as well as the intellectual property and legislative regulatory landscape. We provide guidance on employing machine learning and data analytics, identify gaps in current methods, and describe new scenarios to be unlocked in the next few years in the context of current knowledge.
Collapse
Affiliation(s)
- Ameen Eetemadi
- Department of Computer Science, University of California, Davis, Davis, CA, United States
- Genome Center, University of California, Davis, Davis, CA, United States
| | - Navneet Rai
- Genome Center, University of California, Davis, Davis, CA, United States
| | - Beatriz Merchel Piovesan Pereira
- Genome Center, University of California, Davis, Davis, CA, United States
- Department of Microbiology, University of California, Davis, Davis, CA, United States
| | - Minseung Kim
- Department of Computer Science, University of California, Davis, Davis, CA, United States
- Genome Center, University of California, Davis, Davis, CA, United States
- Process Integration and Predictive Analytics (PIPA LLC), Davis, CA, United States
| | - Harold Schmitz
- Graduate School of Management, University of California, Davis, Davis, CA, United States
| | - Ilias Tagkopoulos
- Department of Computer Science, University of California, Davis, Davis, CA, United States
- Genome Center, University of California, Davis, Davis, CA, United States
- Process Integration and Predictive Analytics (PIPA LLC), Davis, CA, United States
| |
Collapse
|
29
|
Martin BD, Witten D, Willis AD. MODELING MICROBIAL ABUNDANCES AND DYSBIOSIS WITH BETA-BINOMIAL REGRESSION. Ann Appl Stat 2020; 14:94-115. [PMID: 32983313 PMCID: PMC7514055 DOI: 10.1214/19-aoas1283] [Citation(s) in RCA: 168] [Impact Index Per Article: 42.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Using a sample from a population to estimate the proportion of the population with a certain category label is a broadly important problem. In the context of microbiome studies, this problem arises when researchers wish to use a sample from a population of microbes to estimate the population proportion of a particular taxon, known as the taxon's relative abundance. In this paper, we propose a beta-binomial model for this task. Like existing models, our model allows for a taxon's relative abundance to be associated with covariates of interest. However, unlike existing models, our proposal also allows for the overdispersion in the taxon's counts to be associated with covariates of interest. We exploit this model in order to propose tests not only for differential relative abundance, but also for differential variability. The latter is particularly valuable in light of speculation that dysbiosis, the perturbation from a normal microbiome that can occur in certain disease conditions, may manifest as a loss of stability, or increase in variability, of the counts associated with each taxon. We demonstrate the performance of our proposed model using a simulation study and an application to soil microbial data.
Collapse
Affiliation(s)
| | - Daniela Witten
- Departments of Statistics and Biostatistics, University of Washington
| | - Amy D Willis
- Department of Biostatistics, University of Washington
| |
Collapse
|
30
|
Meier R, Thompson JA. A Bayesian framework for identifying consistent patterns of microbial abundance between body sites. Stat Appl Genet Mol Biol 2019; 18:/j/sagmb.2019.18.issue-6/sagmb-2019-0027/sagmb-2019-0027.xml. [PMID: 31702998 PMCID: PMC7944583 DOI: 10.1515/sagmb-2019-0027] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
Recent studies have found that the microbiome in both gut and mouth are associated with diseases of the gut, including cancer. If resident microbes could be found to exhibit consistent patterns between the mouth and gut, disease status could potentially be assessed non-invasively through profiling of oral samples. Currently, there exists no generally applicable method to test for such associations. Here we present a Bayesian framework to identify microbes that exhibit consistent patterns between body sites, with respect to a phenotypic variable. For a given operational taxonomic unit (OTU), a Bayesian regression model is used to obtain Markov-Chain Monte Carlo estimates of abundance among strata, calculate a correlation statistic, and conduct a formal test based on its posterior distribution. Extensive simulation studies demonstrate overall viability of the approach, and provide information on what factors affect its performance. Applying our method to a dataset containing oral and gut microbiome samples from 77 pancreatic cancer patients revealed several OTUs exhibiting consistent patterns between gut and mouth with respect to disease subtype. Our method is well powered for modest sample sizes and moderate strength of association and can be flexibly extended to other research settings using any currently established Bayesian analysis programs.
Collapse
Affiliation(s)
- Richard Meier
- Department of Biostatistics & Data Science, University of Kansas Medical Center, 3901 Rainbow Blvd, Kansas City, KS 66160
| | - Jeffrey A. Thompson
- Department of Biostatistics & Data Science, University of Kansas Medical Center, 3901 Rainbow Blvd, Kansas City, KS 66160
| |
Collapse
|
31
|
Ai D, Pan H, Li X, Gao Y, Liu G, Xia LC. Identifying Gut Microbiota Associated With Colorectal Cancer Using a Zero-Inflated Lognormal Model. Front Microbiol 2019; 10:826. [PMID: 31068913 PMCID: PMC6491826 DOI: 10.3389/fmicb.2019.00826] [Citation(s) in RCA: 94] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Accepted: 04/01/2019] [Indexed: 12/26/2022] Open
Abstract
Colorectal cancer (CRC) is the third most common cancer worldwide. Its incidence is still increasing, and the mortality rate is high. New therapeutic and prognostic strategies are urgently needed. It became increasingly recognized that the gut microbiota composition differs significantly between healthy people and CRC patients. Thus, identifying the difference between gut microbiota of the healthy people and CRC patients is fundamental to understand these microbes' functional roles in the development of CRC. We studied the microbial community structure of a CRC metagenomic dataset of 156 patients and healthy controls, and analyzed the diversity, differentially abundant bacteria, and co-occurrence networks. We applied a modified zero-inflated lognormal (ZIL) model for estimating the relative abundance. We found that the abundance of genera: Anaerostipes, Bilophila, Catenibacterium, Coprococcus, Desulfovibrio, Flavonifractor, Porphyromonas, Pseudoflavonifractor, and Weissella was significantly different between the healthy and CRC groups. We also found that bacteria such as Streptococcus, Parvimonas, Collinsella, and Citrobacter were uniquely co-occurring within the CRC patients. In addition, we found that the microbial diversity of healthy controls is significantly higher than that of the CRC patients, which indicated a significant negative correlation between gut microbiota diversity and the stage of CRC. Collectively, our results strengthened the view that individual microbes as well as the overall structure of gut microbiota were co-evolving with CRC.
Collapse
Affiliation(s)
- Dongmei Ai
- Basic Experimental of Natural Science, University of Science and Technology Beijing, Beijing, China
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing, China
| | - Hongfei Pan
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing, China
| | - Xiaoxin Li
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing, China
| | - Yingxin Gao
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing, China
| | - Gang Liu
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing, China
| | - Li C Xia
- Department of Medicine, Stanford University School of Medicine, Stanford, CA, United States
| |
Collapse
|
32
|
The Supragingival Biofilm in Early Childhood Caries: Clinical and Laboratory Protocols and Bioinformatics Pipelines Supporting Metagenomics, Metatranscriptomics, and Metabolomics Studies of the Oral Microbiome. Methods Mol Biol 2019; 1922:525-548. [PMID: 30838598 DOI: 10.1007/978-1-4939-9012-2_40] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
Early childhood caries (ECC) is a biofilm-mediated disease. Social, environmental, and behavioral determinants as well as innate susceptibility are major influences on its incidence; however, from a pathogenetic standpoint, the disease is defined and driven by oral dysbiosis. In other words, the disease occurs when the natural equilibrium between the host and its oral microbiome shifts toward states that promote demineralization at the biofilm-tooth surface interface. Thus, a comprehensive understanding of dental caries as a disease requires the characterization of both the composition and the function or metabolic activity of the supragingival biofilm according to well-defined clinical statuses. However, taxonomic and functional information of the supragingival biofilm is rarely available in clinical cohorts, and its collection presents unique challenges among very young children. This paper presents a protocol and pipelines available for the conduct of supragingival biofilm microbiome studies among children in the primary dentition, that has been designed in the context of a large-scale population-based genetic epidemiologic study of ECC. The protocol is being developed for the collection of two supragingival biofilm samples from the maxillary primary dentition, enabling downstream taxonomic (e.g., metagenomics) and functional (e.g., transcriptomics and metabolomics) analyses. The protocol is being implemented in the assembly of a pediatric precision medicine cohort comprising over 6000 participants to date, contributing social, environmental, behavioral, clinical, and biological data informing ECC and other oral health outcomes.
Collapse
|
33
|
Jonsson V, Österlund T, Nerman O, Kristiansson E. Modelling of zero-inflation improves inference of metagenomic gene count data. Stat Methods Med Res 2018; 28:3712-3728. [PMID: 30474490 DOI: 10.1177/0962280218811354] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Metagenomics enables the study of gene abundances in complex mixtures of microorganisms and has become a standard methodology for the analysis of the human microbiome. However, gene abundance data is inherently noisy and contains high levels of biological and technical variability as well as an excess of zeros due to non-detected genes. This makes the statistical analysis challenging. In this study, we present a new hierarchical Bayesian model for inference of metagenomic gene abundance data. The model uses a zero-inflated overdispersed Poisson distribution which is able to simultaneously capture the high gene-specific variability as well as zero observations in the data. By analysis of three comprehensive datasets, we show that zero-inflation is common in metagenomic data from the human gut and, if not correctly modelled, it can lead to substantial reductions in statistical power. We also show, by using resampled metagenomic data, that our model has, compared to other methods, a higher and more stable performance for detecting differentially abundant genes. We conclude that proper modelling of the gene-specific variability, including the excess of zeros, is necessary to accurately describe gene abundances in metagenomic data. The proposed model will thus pave the way for new biological insights into the structure of microbial communities.
Collapse
Affiliation(s)
- Viktor Jonsson
- Department of Mathematical Sciences, Chalmers University of Technology and University of Gothenburg, Gothenburg, Sweden.,Computational Systems Biology, Chalmers University of Technology, Gothenburg, Sweden
| | - Tobias Österlund
- Department of Mathematical Sciences, Chalmers University of Technology and University of Gothenburg, Gothenburg, Sweden
| | - Olle Nerman
- Department of Mathematical Sciences, Chalmers University of Technology and University of Gothenburg, Gothenburg, Sweden
| | - Erik Kristiansson
- Department of Mathematical Sciences, Chalmers University of Technology and University of Gothenburg, Gothenburg, Sweden
| |
Collapse
|
34
|
Zhang X, Pei YF, Zhang L, Guo B, Pendegraft AH, Zhuang W, Yi N. Negative Binomial Mixed Models for Analyzing Longitudinal Microbiome Data. Front Microbiol 2018; 9:1683. [PMID: 30093893 PMCID: PMC6070621 DOI: 10.3389/fmicb.2018.01683] [Citation(s) in RCA: 34] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2017] [Accepted: 07/06/2018] [Indexed: 01/09/2023] Open
Abstract
The metagenomics sequencing data provide valuable resources for investigating the associations between the microbiome and host environmental/clinical factors and the dynamic changes of microbial abundance over time. The distinct properties of microbiome measurements include varied total sequence reads across samples, over-dispersion and zero-inflation. Additionally, microbiome studies usually collect samples longitudinally, which introduces time-dependent and correlation structures among the samples and thus further complicates the analysis and interpretation of microbiome count data. In this article, we propose negative binomial mixed models (NBMMs) for longitudinal microbiome studies. The proposed NBMMs can efficiently handle over-dispersion and varying total reads, and can account for the dynamic trend and correlation among longitudinal samples. We develop an efficient and stable algorithm to fit the NBMMs. We evaluate and demonstrate the NBMMs method via extensive simulation studies and application to a longitudinal microbiome data. The results show that the proposed method has desirable properties and outperform the previously used methods in terms of flexible framework for modeling correlation structures and detecting dynamic effects. We have developed an R package NBZIMM to implement the proposed method, which is freely available from the public GitHub repository http://github.com//nyiuab//NBZIMM and provides a useful tool for analyzing longitudinal microbiome data.
Collapse
Affiliation(s)
- Xinyan Zhang
- Department of Biostatistics, Jiann-Ping Hsu College of Public Health, Georgia Southern University, Statesboro, GA, United States
| | - Yu-Fang Pei
- Department of Epidemiology and Health Statistics, School of Public Health, Medical College of Soochow University, Suzhou, China
| | - Lei Zhang
- Department of Epidemiology and Health Statistics, School of Public Health, Medical College of Soochow University, Suzhou, China
| | - Boyi Guo
- Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Amanda H Pendegraft
- Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Wenzhuo Zhuang
- Department of Cell Biology, School of Biology & Basic Medical Science, Soochow University, Suzhou, China
| | - Nengjun Yi
- Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, AL, United States
| |
Collapse
|
35
|
Hu J, Koh H, He L, Liu M, Blaser MJ, Li H. A two-stage microbial association mapping framework with advanced FDR control. MICROBIOME 2018; 6:131. [PMID: 30045760 PMCID: PMC6060480 DOI: 10.1186/s40168-018-0517-1] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/23/2018] [Accepted: 07/11/2018] [Indexed: 05/31/2023]
Abstract
BACKGROUND In microbiome studies, it is important to detect taxa which are associated with pathological outcomes at the lowest definable taxonomic rank, such as genus or species. Traditionally, taxa at the target rank are tested for individual association, followed by the Benjamini-Hochberg (BH) procedure to control for false discovery rate (FDR). However, this approach neglects the dependence structure among taxa and may lead to conservative results. The taxonomic tree of microbiome data represents alignment from phylum to species rank and characterizes evolutionary relationships across microbial taxa. Taxa that are closer on the tree usually have similar responses to the exposure (environment). The statistical power in microbial association tests can be enhanced by efficiently employing the prior evolutionary information via the taxonomic tree. METHODS We propose a two-stage microbial association mapping framework (massMap) which uses grouping information from the taxonomic tree to strengthen statistical power in association tests at the target rank. massMap first screens the association of taxonomic groups at a pre-selected higher taxonomic rank using a powerful microbial group test OMiAT. The method then proceeds to test the association for each candidate taxon at the target rank within the significant taxonomic groups identified in the first stage. Hierarchical BH (HBH) and selected subset testing (SST) procedures are evaluated to control the FDR for the two-stage structured tests. RESULTS Our simulations show that massMap incorporating OMiAT and the advanced FDR controlling methodologies largely alleviates the multiplicity issue. It is statistically more powerful than the traditional association mapping directly at the target rank while controlling the FDR at desired levels under most scenarios. In our real data analyses, massMap detects more or the same amount of associated species with smaller adjusted p values compared to the traditional method, which further illustrates the efficiency of the proposed framework. The R package of massMap is publicly available at https://sites.google.com/site/huilinli09/software and https://github.com/JiyuanHu/ . CONCLUSIONS massMap is a novel microbial association mapping framework and achieves additional efficiency by utilizing the intrinsic taxonomic structure of microbiome data.
Collapse
Affiliation(s)
- Jiyuan Hu
- Division of Biostatistics, Department of Population Health, New York University School of Medicine, New York, NY 10016 USA
- Shanghai Center for Mathematical Sciences, Fudan University, Shanghai, 200433 China
| | - Hyunwook Koh
- Division of Biostatistics, Department of Population Health, New York University School of Medicine, New York, NY 10016 USA
| | - Linchen He
- Division of Biostatistics, Department of Population Health, New York University School of Medicine, New York, NY 10016 USA
| | - Menghan Liu
- Department of Medicine, New York University School of Medicine, New York, NY 10016 USA
| | - Martin J. Blaser
- Department of Medicine, New York University School of Medicine, New York, NY 10016 USA
| | - Huilin Li
- Division of Biostatistics, Department of Population Health, New York University School of Medicine, New York, NY 10016 USA
| |
Collapse
|
36
|
Li Z, Lee K, Karagas MR, Madan JC, Hoen AG, O'Malley AJ, Li H. Conditional Regression Based on a Multivariate Zero-Inflated Logistic-Normal Model for Microbiome Relative Abundance Data. STATISTICS IN BIOSCIENCES 2018; 10:587-608. [PMID: 30923584 DOI: 10.1007/s12561-018-9219-2] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
The human microbiome plays critical roles in human health and has been linked to many diseases. While advanced sequencing technologies can characterize the composition of the microbiome in unprecedented detail, it remains challenging to disentangle the complex interplay between human microbiome and disease risk factors due to the complicated nature of microbiome data. Excessive numbers e f zero values, high dimensionality, the hierarchical phylogenetic tree and compositional structure are compounded and consequently make existing methods inadequate to appropriately address these issues. We propose a multivariate two-part zero-inflated logistic normal (MZILN) model to analyze the association of disease risk factors with individual microbial taxa and overall microbial community composition. This approach can naturally handle excessive numbers e f zeros and the compositional data structure with the discrete part and the logistic-normal part e f the model. For parameter estimation, an estimating equations approach is employed that enables us to address the complex inter-taxa correlation structure induced by the hierarchical phylogenetic tree structure and the compositional data structure. This model is able to incorporate standard regularization approaches to deal with high dimensionality. Simulation shews that our model outperforms existing methods. Our approach is also compared to ethers using the analysis of real microbiome data.
Collapse
Affiliation(s)
- Zhigang Li
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, 1 Medical Center Drive, Lebanon, NK 03756, USA.,Children's Environmental Health and Disease Prevention Research Center at Dartmouth, Hanever, New Hampshire.,Department of Epidemiology, Geisel School of Medicine at Dartmouth, 1 Medical Center Drive, Lebanon, NH 03756, USA.,Department of Biestatistics, University e f Florida, Gainesville, fL 32611, USA
| | | | - Margaret R Karagas
- Children's Environmental Health and Disease Prevention Research Center at Dartmouth, Hanever, New Hampshire.,Department of Epidemiology, Geisel School of Medicine at Dartmouth, 1 Medical Center Drive, Lebanon, NH 03756, USA
| | - Juliette C Madan
- Children's Environmental Health and Disease Prevention Research Center at Dartmouth, Hanever, New Hampshire.,Department of Epidemiology, Geisel School of Medicine at Dartmouth, 1 Medical Center Drive, Lebanon, NH 03756, USA.,Division of Neenatelegy, Department of Pediatrics, Children's Hospital at Dartmouth, Lebanon, New Kampshire
| | - Anne G Hoen
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, 1 Medical Center Drive, Lebanon, NK 03756, USA.,Children's Environmental Health and Disease Prevention Research Center at Dartmouth, Hanever, New Hampshire.,Department of Epidemiology, Geisel School of Medicine at Dartmouth, 1 Medical Center Drive, Lebanon, NH 03756, USA
| | - A James O'Malley
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, 1 Medical Center Drive, Lebanon, NK 03756, USA.,The Dartmouth Institute for Kealth Policy and Clinical Practice, Geisel School e f Medicine at Dartmouth, 1 Medical Center Drive, Lebanon, NK 03756, USA
| | - Hongzhe Li
- Department of Biestatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, PA 19104, USA
| |
Collapse
|
37
|
Chai H, Jiang H, Lin L, Liu L. A marginalized two-part Beta regression model for microbiome compositional data. PLoS Comput Biol 2018; 14:e1006329. [PMID: 30036363 PMCID: PMC6072097 DOI: 10.1371/journal.pcbi.1006329] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Revised: 08/02/2018] [Accepted: 06/26/2018] [Indexed: 12/21/2022] Open
Abstract
In microbiome studies, an important goal is to detect differential abundance of microbes across clinical conditions and treatment options. However, the microbiome compositional data (quantified by relative abundance) are highly skewed, bounded in [0, 1), and often have many zeros. A two-part model is commonly used to separate zeros and positive values explicitly by two submodels: a logistic model for the probability of a specie being present in Part I, and a Beta regression model for the relative abundance conditional on the presence of the specie in Part II. However, the regression coefficients in Part II cannot provide a marginal (unconditional) interpretation of covariate effects on the microbial abundance, which is of great interest in many applications. In this paper, we propose a marginalized two-part Beta regression model which captures the zero-inflation and skewness of microbiome data and also allows investigators to examine covariate effects on the marginal (unconditional) mean. We demonstrate its practical performance using simulation studies and apply the model to a real metagenomic dataset on mouse skin microbiota. We find that under the proposed marginalized model, without loss in power, the likelihood ratio test performs better in controlling the type I error than those under conventional methods.
Collapse
Affiliation(s)
- Haitao Chai
- Institute for Financial Studies, Shandong University, Jinan, Shandong, China
- Department of Preventive Medicine, Northwestern University, Chicago, Illinois, United States of America
| | - Hongmei Jiang
- Department of Statistics, Northwestern University, Evanston, Illinois, United States of America
| | - Lu Lin
- Institute for Financial Studies, Shandong University, Jinan, Shandong, China
| | - Lei Liu
- Department of Preventive Medicine, Northwestern University, Chicago, Illinois, United States of America
- Division of Biostatistics, Washington University in St. Louis, St. Louis, Missouri, United States of America
| |
Collapse
|
38
|
Fischer M, Strauch B, Renard BY. Abundance estimation and differential testing on strain level in metagenomics data. Bioinformatics 2018; 33:i124-i132. [PMID: 28881972 PMCID: PMC5870649 DOI: 10.1093/bioinformatics/btx237] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023] Open
Abstract
Motivation Current metagenomics approaches allow analyzing the composition of microbial communities at high resolution. Important changes to the composition are known to even occur on strain level and to go hand in hand with changes in disease or ecological state. However, specific challenges arise for strain level analysis due to highly similar genome sequences present. Only a limited number of tools approach taxa abundance estimation beyond species level and there is a strong need for dedicated tools for strain resolution and differential abundance testing. Methods We present DiTASiC (Differential Taxa Abundance including Similarity Correction) as a novel approach for quantification and differential assessment of individual taxa in metagenomics samples. We introduce a generalized linear model for the resolution of shared read counts which cause a significant bias on strain level. Further, we capture abundance estimation uncertainties, which play a crucial role in differential abundance analysis. A novel statistical framework is built, which integrates the abundance variance and infers abundance distributions for differential testing sensitive to strain level. Results As a result, we obtain highly accurate abundance estimates down to sub-strain level and enable fine-grained resolution of strain clusters. We demonstrate the relevance of read ambiguity resolution and integration of abundance uncertainties for differential analysis. Accurate detections of even small changes are achieved and false-positives are significantly reduced. Superior performance is shown on latest benchmark sets of various complexities and in comparison to existing methods. Availability and Implementation DiTASiC code is freely available from https://rki_bioinformatics.gitlab.io/ditasic. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Martina Fischer
- Bioinformatics Unit (MF1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, Berlin, Germany
| | - Benjamin Strauch
- Bioinformatics Unit (MF1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, Berlin, Germany
| | - Bernhard Y Renard
- Bioinformatics Unit (MF1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, Berlin, Germany
| |
Collapse
|
39
|
Liu Z, Lin S. Sparse Treatment-Effect Model for Taxon Identification with High-Dimensional Metagenomic Data. Methods Mol Biol 2018; 1849:309-318. [PMID: 30298262 DOI: 10.1007/978-1-4939-8728-3_19] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
To identify disease-associated taxa is an important task in metagenomics. To date, many methods have been proposed for feature selection and prediction. However, those proposed methods are either using univariate (generalized) regression approaches to get the corresponding P-values without considering the interactions among taxa, or using lasso or L0 type sparse modeling approaches to identify taxa with best predictions without providing P-values. To the best of our knowledge, there are no available methods that consider taxon interactions and also generate P-values.In this paper, we propose a treatment-effect model for identifying taxa (STEMIT) and performing statistical inference with high-dimensional metagenomic data. STEMIT will provide a P-value for a taxon through a two-step treatment-effect maximization. It will provide causal inference if the study is a clinical trial. We first identify taxa associated with the treatment-effect variable and the targeting feature with sparse modeling, and then estimate the P-value of the targeting gene with ordinary least square (OLS) regression. We demonstrate that the proposed method is efficient and can identify biologically important taxa with a real metagenomic data set. The software for L0 sparse modeling can be downloaded at https://cran.r-project.org/web/packages/l0ara/ .
Collapse
Affiliation(s)
- Zhenqiu Liu
- Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA, USA.
| | - Shili Lin
- Department of Statistics, The Ohio State University, Columbus, OH, USA
| |
Collapse
|
40
|
Zhang X, Mallick H, Tang Z, Zhang L, Cui X, Benson AK, Yi N. Negative binomial mixed models for analyzing microbiome count data. BMC Bioinformatics 2017; 18:4. [PMID: 28049409 PMCID: PMC5209949 DOI: 10.1186/s12859-016-1441-7] [Citation(s) in RCA: 86] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2016] [Accepted: 12/21/2016] [Indexed: 12/21/2022] Open
Abstract
Background Recent advances in next-generation sequencing (NGS) technology enable researchers to collect a large volume of metagenomic sequencing data. These data provide valuable resources for investigating interactions between the microbiome and host environmental/clinical factors. In addition to the well-known properties of microbiome count measurements, for example, varied total sequence reads across samples, over-dispersion and zero-inflation, microbiome studies usually collect samples with hierarchical structures, which introduce correlation among the samples and thus further complicate the analysis and interpretation of microbiome count data. Results In this article, we propose negative binomial mixed models (NBMMs) for detecting the association between the microbiome and host environmental/clinical factors for correlated microbiome count data. Although having not dealt with zero-inflation, the proposed mixed-effects models account for correlation among the samples by incorporating random effects into the commonly used fixed-effects negative binomial model, and can efficiently handle over-dispersion and varying total reads. We have developed a flexible and efficient IWLS (Iterative Weighted Least Squares) algorithm to fit the proposed NBMMs by taking advantage of the standard procedure for fitting the linear mixed models. Conclusions We evaluate and demonstrate the proposed method via extensive simulation studies and the application to mouse gut microbiome data. The results show that the proposed method has desirable properties and outperform the previously used methods in terms of both empirical power and Type I error. The method has been incorporated into the freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/ and http://github.com/abbyyan3/BhGLM), providing a useful tool for analyzing microbiome data.
Collapse
Affiliation(s)
- Xinyan Zhang
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, 35294-0022, USA
| | - Himel Mallick
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA.,Program in Medical and Population Genetics, the Broad Institute, Cambridge, MA, 02142, USA
| | - Zaixiang Tang
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, 215123, China
| | - Lei Zhang
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, 215123, China
| | - Xiangqin Cui
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, 35294-0022, USA
| | - Andrew K Benson
- Department of Food Science and Technology and Core for Applied Genomics and Ecology, University of Nebraska, Lincoln, NE, 68583, USA
| | - Nengjun Yi
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, 35294-0022, USA.
| |
Collapse
|
41
|
Wu C, Chen J, Kim J, Pan W. An adaptive association test for microbiome data. Genome Med 2016; 8:56. [PMID: 27198579 PMCID: PMC4872356 DOI: 10.1186/s13073-016-0302-3] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2016] [Accepted: 04/12/2016] [Indexed: 02/07/2023] Open
Abstract
There is increasing interest in investigating how the compositions of microbial communities are associated with human health and disease. Although existing methods have identified many associations, a proper choice of a phylogenetic distance is critical for the power of these methods. To assess an overall association between the composition of a microbial community and an outcome of interest, we present a novel multivariate testing method called aMiSPU, that is joint and highly adaptive over all observed taxa and thus high powered across various scenarios, alleviating the issue with the choice of a phylogenetic distance. Our simulations and real-data analyses demonstrated that the aMiSPU test was often more powerful than several competing methods while correctly controlling type I error rates. The R package MiSPU is available at https://github.com/ChongWu-Biostat/MiSPU
and CRAN.
Collapse
Affiliation(s)
- Chong Wu
- Division of Biostatistics, University of Minnesota, 420 Delaware St. SE, Minneapolis, 55455, USA
| | - Jun Chen
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 First St. SW, Rochester, 55905, USA
| | - Junghi Kim
- Division of Biostatistics, University of Minnesota, 420 Delaware St. SE, Minneapolis, 55455, USA
| | - Wei Pan
- Division of Biostatistics, University of Minnesota, 420 Delaware St. SE, Minneapolis, 55455, USA.
| |
Collapse
|
42
|
Chen EZ, Li H. A two-part mixed-effects model for analyzing longitudinal microbiome compositional data. Bioinformatics 2016; 32:2611-7. [PMID: 27187200 DOI: 10.1093/bioinformatics/btw308] [Citation(s) in RCA: 128] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2016] [Accepted: 05/11/2016] [Indexed: 12/22/2022] Open
Abstract
MOTIVATION The human microbial communities are associated with many human diseases such as obesity, diabetes and inflammatory bowel disease. High-throughput sequencing technology has been widely used to quantify the microbial composition in order to understand its impacts on human health. Longitudinal measurements of microbial communities are commonly obtained in many microbiome studies. A key question in such microbiome studies is to identify the microbes that are associated with clinical outcomes or environmental factors. However, microbiome compositional data are highly skewed, bounded in [0,1), and often sparse with many zeros. In addition, the observations from repeated measures in longitudinal studies are correlated. A method that takes into account these features is needed for association analysis in longitudinal microbiome data. RESULTS In this paper, we propose a two-part zero-inflated Beta regression model with random effects (ZIBR) for testing the association between microbial abundance and clinical covariates for longitudinal microbiome data. The model includes a logistic regression component to model presence/absence of a microbe in the samples and a Beta regression component to model non-zero microbial abundance, where each component includes a random effect to account for the correlations among the repeated measurements on the same subject. Both simulation studies and the application to real microbiome data have shown that ZIBR model outperformed the previously used methods. The method provides a useful tool for identifying the relevant taxa based on longitudinal or repeated measures in microbiome research. AVAILABILITY AND IMPLEMENTATION https://github.com/chvlyl/ZIBR CONTACT: hongzhe@upenn.edu.
Collapse
Affiliation(s)
- Eric Z Chen
- Genomics and Computational Biology Graduate Group Department of Biostatistics and Epidemiology, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA
| | - Hongzhe Li
- Genomics and Computational Biology Graduate Group Department of Biostatistics and Epidemiology, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA
| |
Collapse
|