1
|
Zhang L, Zhang X, Leach JM, Rahman AF, Yi N. Bayesian compositional models for ordinal response. Stat Methods Med Res 2024:9622802241247730. [PMID: 38654396 DOI: 10.1177/09622802241247730] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/25/2024]
Abstract
Ordinal response is commonly found in medicine, biology, and other fields. In many situations, the predictors for this ordinal response are compositional, which means that the sum of predictors for each sample is fixed. Examples of compositional data include the relative abundance of species in microbiome data and the relative frequency of nutrition concentrations. Moreover, the predictors that are strongly correlated tend to have similar influence on the response outcome. Conventional cumulative logistic regression models for ordinal responses ignore the fixed-sum constraint on predictors and their associated interrelationships, and thus are not appropriate for analyzing compositional predictors.To solve this problem, we proposed Bayesian Compositional Models for Ordinal Response to analyze the relationship between compositional data and an ordinal response with a structured regularized horseshoe prior for the compositional coefficients and a soft sum-to-zero restriction on coefficients through the prior distribution. The method was implemented with R package rstan using efficient Hamiltonian Monte Carlo algorithm. We performed simulations to compare the proposed approach and existing methods for ordinal responses. Results revealed that our proposed method outperformed the existing methods in terms of parameter estimation and prediction. We also applied the proposed method to a microbiome study HMP2Data, to find microorganisms linked to ordinal inflammatory bowel disease levels. To make this work reproducible, the code and data used in this paper are available at https://github.com/Li-Zhang28/BCO.
Collapse
Affiliation(s)
- Li Zhang
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Xinyan Zhang
- School of Data Science and Analytics, Kennesaw State University, Kennesaw, GA, USA
| | - Justin M Leach
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Akm F Rahman
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Nengjun Yi
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
| |
Collapse
|
2
|
Yerke A, Fry Brumit D, Fodor AA. Proportion-based normalizations outperform compositional data transformations in machine learning applications. MICROBIOME 2024; 12:45. [PMID: 38443997 PMCID: PMC10913632 DOI: 10.1186/s40168-023-01747-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Accepted: 12/22/2023] [Indexed: 03/07/2024]
Abstract
BACKGROUND Normalization, as a pre-processing step, can significantly affect the resolution of machine learning analysis for microbiome studies. There are countless options for normalization scheme selection. In this study, we examined compositionally aware algorithms including the additive log ratio (alr), the centered log ratio (clr), and a recent evolution of the isometric log ratio (ilr) in the form of balance trees made with the PhILR R package. We also looked at compositionally naïve transformations such as raw counts tables and several transformations that are based on relative abundance, such as proportions, the Hellinger transformation, and a transformation based on the logarithm of proportions (which we call "lognorm"). RESULTS In our evaluation, we used 65 metadata variables culled from four publicly available datasets at the amplicon sequence variant (ASV) level with a random forest machine learning algorithm. We found that different common pre-processing steps in the creation of the balance trees made very little difference in overall performance. Overall, we found that the compositionally aware data transformations such as alr, clr, and ilr (PhILR) performed generally slightly worse or only as well as compositionally naïve transformations. However, relative abundance-based transformations outperformed most other transformations by a small but reliably statistically significant margin. CONCLUSIONS Our results suggest that minimizing the complexity of transformations while correcting for read depth may be a generally preferable strategy in preparing data for machine learning compared to more sophisticated, but more complex, transformations that attempt to better correct for compositionality. Video Abstract.
Collapse
Affiliation(s)
- Aaron Yerke
- Department of Bioinformatics and Genomics, Bioinformatics Building, UNC Charlotte, The University of North Carolina, Charlotte 9331 Robert D. Snyder Rd, Charlotte, USA
- Food Components and Health Laboratory, USDA, ARS, Beltsville Human Nutrition Research Center, Beltsville, USA
| | - Daisy Fry Brumit
- Department of Bioinformatics and Genomics, Bioinformatics Building, UNC Charlotte, The University of North Carolina, Charlotte 9331 Robert D. Snyder Rd, Charlotte, USA
| | - Anthony A Fodor
- Department of Bioinformatics and Genomics, Bioinformatics Building, UNC Charlotte, The University of North Carolina, Charlotte 9331 Robert D. Snyder Rd, Charlotte, USA.
| |
Collapse
|
3
|
Zhang L, Zhang X, Yi N. Bayesian compositional generalized linear models for analyzing microbiome data. Stat Med 2024; 43:141-155. [PMID: 37985956 DOI: 10.1002/sim.9946] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Revised: 10/07/2023] [Accepted: 10/12/2023] [Indexed: 11/22/2023]
Abstract
The crucial impact of the microbiome on human health and disease has gained significant scientific attention. Researchers seek to connect microbiome features with health conditions, aiming to predict diseases and develop personalized medicine strategies. However, the practicality of conventional models is restricted due to important aspects of microbiome data. Specifically, the data observed is compositional, as the counts within each sample are bound by a fixed-sum constraint. Moreover, microbiome data often exhibits high dimensionality, wherein the number of variables surpasses the available samples. In addition, microbiome features exhibiting phenotypical similarity usually have similar influence on the response variable. To address the challenges posed by these aspects of the data structure, we proposed Bayesian compositional generalized linear models for analyzing microbiome data (BCGLM) with a structured regularized horseshoe prior for the compositional coefficients and a soft sum-to-zero restriction on coefficients through the prior distribution. We fitted the proposed models using Markov Chain Monte Carlo (MCMC) algorithms with R package rstan. The performance of the proposed method was assessed by extensive simulation studies. The simulation results show that our approach outperforms existing methods with higher accuracy of coefficient estimates and lower prediction error. We also applied the proposed method to microbiome study to find microorganisms linked to inflammatory bowel disease (IBD). To make this work reproducible, the code and data used in this article are available at https://github.com/Li-Zhang28/BCGLM.
Collapse
Affiliation(s)
- Li Zhang
- Department of Biostatistics, University of Alabama at Birmingham, Alabama, USA
| | - Xinyan Zhang
- School of Data Science and Analytics, Kennesaw State University, Kennesaw, Georgia, USA
| | - Nengjun Yi
- Department of Biostatistics, University of Alabama at Birmingham, Alabama, USA
| |
Collapse
|
4
|
Wang Y, Shojaie A, Randolph T, Knight P, Ma J. GENERALIZED MATRIX DECOMPOSITION REGRESSION: ESTIMATION AND INFERENCE FOR TWO-WAY STRUCTURED DATA. Ann Appl Stat 2023; 17:2944-2969. [PMID: 38149262 PMCID: PMC10751029 DOI: 10.1214/23-aoas1746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2023]
Abstract
Motivated by emerging applications in ecology, microbiology, and neuroscience, this paper studies high-dimensional regression with two-way structured data. To estimate the high-dimensional coefficient vector, we propose the generalized matrix decomposition regression (GMDR) to efficiently leverage auxiliary information on row and column structures. GMDR extends the principal component regression (PCR) to two-way structured data, but unlike PCR, GMDR selects the components that are most predictive of the outcome, leading to more accurate prediction. For inference on regression coefficients of individual variables, we propose the generalized matrix decomposition inference (GMDI), a general high-dimensional inferential framework for a large family of estimators that include the proposed GMDR estimator. GMDI provides more flexibility for incorporating relevant auxiliary row and column structures. As a result, GMDI does not require the true regression coefficients to be sparse, but constrains the coordinate system representing the regression coefficients according to the column structure. GMDI also allows dependent and heteroscedastic observations. We study the theoretical properties of GMDI in terms of both the type-I error rate and power and demonstrate the effectiveness of GMDR and GMDI in simulation studies and an application to human microbiome data.
Collapse
Affiliation(s)
- Yue Wang
- Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus
| | - Ali Shojaie
- Department of Biostatistics, University of Washington
| | | | | | - Jing Ma
- Public Health Sciences Division, Fred Hutchinson Cancer Center
| |
Collapse
|
5
|
Huang S, Ailer E, Kilbertus N, Pfister N. Supervised learning and model analysis with compositional data. PLoS Comput Biol 2023; 19:e1011240. [PMID: 37390111 DOI: 10.1371/journal.pcbi.1011240] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2023] [Accepted: 06/03/2023] [Indexed: 07/02/2023] Open
Abstract
Supervised learning, such as regression and classification, is an essential tool for analyzing modern high-throughput sequencing data, for example in microbiome research. However, due to the compositionality and sparsity, existing techniques are often inadequate. Either they rely on extensions of the linear log-contrast model (which adjust for compositionality but cannot account for complex signals or sparsity) or they are based on black-box machine learning methods (which may capture useful signals, but lack interpretability due to the compositionality). We propose KernelBiome, a kernel-based nonparametric regression and classification framework for compositional data. It is tailored to sparse compositional data and is able to incorporate prior knowledge, such as phylogenetic structure. KernelBiome captures complex signals, including in the zero-structure, while automatically adapting model complexity. We demonstrate on par or improved predictive performance compared with state-of-the-art machine learning methods on 33 publicly available microbiome datasets. Additionally, our framework provides two key advantages: (i) We propose two novel quantities to interpret contributions of individual components and prove that they consistently estimate average perturbation effects of the conditional mean, extending the interpretability of linear log-contrast coefficients to nonparametric models. (ii) We show that the connection between kernels and distances aids interpretability and provides a data-driven embedding that can augment further analysis. KernelBiome is available as an open-source Python package on PyPI and at https://github.com/shimenghuang/KernelBiome.
Collapse
Affiliation(s)
- Shimeng Huang
- Department of Mathematical Sciences, University of Copenhagen, Copenhagen, Denmark
| | | | - Niki Kilbertus
- Helmholtz Munich, Munich, Germany
- Technical University of Munich, Munich, Germany
| | - Niklas Pfister
- Department of Mathematical Sciences, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
6
|
Li G, Li Y, Chen K. It's all relative: Regression analysis with compositional predictors. Biometrics 2023; 79:1318-1329. [PMID: 35616500 PMCID: PMC9767704 DOI: 10.1111/biom.13703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Accepted: 05/18/2022] [Indexed: 01/05/2023]
Abstract
Compositional data reside in a simplex and measure fractions or proportions of parts to a whole. Most existing regression methods for such data rely on log-ratio transformations that are inadequate or inappropriate in modeling high-dimensional data with excessive zeros and hierarchical structures. Moreover, such models usually lack a straightforward interpretation due to the interrelation between parts of a composition. We develop a novel relative-shift regression framework that directly uses proportions as predictors. The new framework provides a paradigm shift for regression analysis with compositional predictors and offers a superior interpretation of how shifting concentration between parts affects the response. New equi-sparsity and tree-guided regularization methods and an efficient smoothing proximal gradient algorithm are developed to facilitate feature aggregation and dimension reduction in regression. A unified finite-sample prediction error bound is derived for the proposed regularized estimators. We demonstrate the efficacy of the proposed methods in extensive simulation studies and a real gut microbiome study. Guided by the taxonomy of the microbiome data, the framework identifies important taxa at different taxonomic levels associated with the neurodevelopment of preterm infants.
Collapse
Affiliation(s)
- Gen Li
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor., Michigan, USA
| | - Yan Li
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor., Michigan, USA
| | - Kun Chen
- Department of Statistics, University of Connecticut, Connecticut, USA
| |
Collapse
|
7
|
Loganathan T, Priya Doss C G. The influence of machine learning technologies in gut microbiome research and cancer studies - A review. Life Sci 2022; 311:121118. [DOI: 10.1016/j.lfs.2022.121118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Revised: 10/19/2022] [Accepted: 10/19/2022] [Indexed: 11/18/2022]
|
8
|
Boyraz A, Pawlowsky-Glahn V, Egozcue JJ, Acar AC. Principal microbial groups: compositional alternative to phylogenetic grouping of microbiome data. Brief Bioinform 2022; 23:6675749. [PMID: 36007229 DOI: 10.1093/bib/bbac328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Revised: 07/19/2022] [Accepted: 07/20/2022] [Indexed: 11/13/2022] Open
Abstract
Statistical and machine learning techniques based on relative abundances have been used to predict health conditions and to identify microbial biomarkers. However, high dimensionality, sparsity and the compositional nature of microbiome data represent statistical challenges. On the other hand, the taxon grouping allows summarizing microbiome abundance with a coarser resolution in a lower dimension, but it presents new challenges when correlating taxa with a disease. In this work, we present a novel approach that groups Operational Taxonomical Units (OTUs) based only on relative abundances as an alternative to taxon grouping. The proposed procedure acknowledges the compositional data making use of principal balances. The identified groups are called Principal Microbial Groups (PMGs). The procedure reduces the need for user-defined aggregation of $\textrm{OTU}$s and offers the possibility of working with coarse group of $\textrm{OTU}$s, which are not present in a phylogenetic tree. PMGs can be used for two different goals: (1) as a dimensionality reduction method for compositional data, (2) as an aggregation procedure that provides an alternative to taxon grouping for construction of microbial balances afterward used for disease prediction. We illustrate the procedure with a cirrhosis study data. PMGs provide a coherent data analysis for the search of biomarkers in human microbiota. The source code and demo data for PMGs are available at: https://github.com/asliboyraz/PMGs.
Collapse
Affiliation(s)
- Aslı Boyraz
- Department of Computer Programming, Recep Tayyip Erdoğan University, Ardeşen Vocational School, Rize, 53400, Turkey
| | - Vera Pawlowsky-Glahn
- Department of Computer Sciences, Applied Mathematics and Statistics, University of Girona, Campus Montilivi, 17003 Girona, Spain
| | - Juan José Egozcue
- Department of Civil and Environmental Engineering, Universitat Politécnica de Catalunya, Barcelona, 08034, Spain
| | - Aybar Can Acar
- Department of Medical Informatics, Middle East Technical University, Ankara Turkey
| |
Collapse
|
9
|
Principal Amalgamation Analysis for Microbiome Data. Genes (Basel) 2022; 13:genes13071139. [PMID: 35885922 PMCID: PMC9318429 DOI: 10.3390/genes13071139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2022] [Revised: 06/14/2022] [Accepted: 06/21/2022] [Indexed: 12/02/2022] Open
Abstract
In recent years microbiome studies have become increasingly prevalent and large-scale. Through high-throughput sequencing technologies and well-established analytical pipelines, relative abundance data of operational taxonomic units and their associated taxonomic structures are routinely produced. Since such data can be extremely sparse and high dimensional, there is often a genuine need for dimension reduction to facilitate data visualization and downstream statistical analysis. We propose Principal Amalgamation Analysis (PAA), a novel amalgamation-based and taxonomy-guided dimension reduction paradigm for microbiome data. Our approach aims to aggregate the compositions into a smaller number of principal compositions, guided by the available taxonomic structure, by minimizing a properly measured loss of information. The choice of the loss function is flexible and can be based on familiar diversity indices for preserving either within-sample or between-sample diversity in the data. To enable scalable computation, we develop a hierarchical PAA algorithm to trace the entire trajectory of successive simple amalgamations. Visualization tools including dendrogram, scree plot, and ordination plot are developed. The effectiveness of PAA is demonstrated using gut microbiome data from a preterm infant study and an HIV infection study.
Collapse
|
10
|
|
11
|
Khomich M, Måge I, Rud I, Berget I. Analysing microbiome intervention design studies: Comparison of alternative multivariate statistical methods. PLoS One 2021; 16:e0259973. [PMID: 34793531 PMCID: PMC8601541 DOI: 10.1371/journal.pone.0259973] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2021] [Accepted: 10/30/2021] [Indexed: 12/13/2022] Open
Abstract
The diet plays a major role in shaping gut microbiome composition and function in both humans and animals, and dietary intervention trials are often used to investigate and understand these effects. A plethora of statistical methods for analysing the differential abundance of microbial taxa exists, and new methods are constantly being developed, but there is a lack of benchmarking studies and clear consensus on the best multivariate statistical practices. This makes it hard for a biologist to decide which method to use. We compared the outcomes of generic multivariate ANOVA (ASCA and FFMANOVA) against statistical methods commonly used for community analyses (PERMANOVA and SIMPER) and methods designed for analysis of count data from high-throughput sequencing experiments (ALDEx2, ANCOM and DESeq2). The comparison is based on both simulated data and five published dietary intervention trials representing different subjects and study designs. We found that the methods testing differences at the community level were in agreement regarding both effect size and statistical significance. However, the methods that provided ranking and identification of differentially abundant operational taxonomic units (OTUs) gave incongruent results, implying that the choice of method is likely to influence the biological interpretations. The generic multivariate ANOVA tools have the flexibility needed for analysing multifactorial experiments and provide outputs at both the community and OTU levels; good performance in the simulation studies suggests that these statistical tools are also suitable for microbiome data sets.
Collapse
Affiliation(s)
- Maryia Khomich
- Division of Food Science, Department of Food Safety and Quality, Nofima – Norwegian Institute of Food, Fisheries and Aquaculture Research, Ås, Norway
- Department of Clinical Science, University of Bergen, Bergen, Norway
- * E-mail: , (MK); (IM)
| | - Ingrid Måge
- Division of Food Science, Department of Raw Materials and Process Optimisation, Nofima – Norwegian Institute of Food, Fisheries and Aquaculture Research, Ås, Norway
- * E-mail: , (MK); (IM)
| | - Ida Rud
- Division of Food Science, Department of Food Safety and Quality, Nofima – Norwegian Institute of Food, Fisheries and Aquaculture Research, Ås, Norway
| | - Ingunn Berget
- Division of Food Science, Department of Raw Materials and Process Optimisation, Nofima – Norwegian Institute of Food, Fisheries and Aquaculture Research, Ås, Norway
| |
Collapse
|
12
|
Wang C, Hu J, Blaser MJ, Li H. Microbial trend analysis for common dynamic trend, group comparison, and classification in longitudinal microbiome study. BMC Genomics 2021; 22:667. [PMID: 34525957 PMCID: PMC8442444 DOI: 10.1186/s12864-021-07948-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2020] [Accepted: 08/25/2021] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND The human microbiome is inherently dynamic and its dynamic nature plays a critical role in maintaining health and driving disease. With an increasing number of longitudinal microbiome studies, scientists are eager to learn the comprehensive characterization of microbial dynamics and their implications to the health and disease-related phenotypes. However, due to the challenging structure of longitudinal microbiome data, few analytic methods are available to characterize the microbial dynamics over time. RESULTS We propose a microbial trend analysis (MTA) framework for the high-dimensional and phylogenetically-based longitudinal microbiome data. In particular, MTA can perform three tasks: 1) capture the common microbial dynamic trends for a group of subjects at the community level and identify the dominant taxa; 2) examine whether or not the microbial overall dynamic trends are significantly different between groups; 3) classify an individual subject based on its longitudinal microbial profiling. Our extensive simulations demonstrate that the proposed MTA framework is robust and powerful in hypothesis testing, taxon identification, and subject classification. Our real data analyses further illustrate the utility of MTA through a longitudinal study in mice. CONCLUSIONS The proposed MTA framework is an attractive and effective tool in investigating dynamic microbial pattern from longitudinal microbiome studies.
Collapse
Affiliation(s)
- Chan Wang
- Division of Biostatistics, Department of Population Health, New York University School of Medicine, New York, 10016 NY USA
| | - Jiyuan Hu
- Division of Biostatistics, Department of Population Health, New York University School of Medicine, New York, 10016 NY USA
| | - Martin J. Blaser
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, 08854-8021 NJ USA
| | - Huilin Li
- Division of Biostatistics, Department of Population Health, New York University School of Medicine, New York, 10016 NY USA
| |
Collapse
|
13
|
Bien J, Yan X, Simpson L, Müller CL. Tree-aggregated predictive modeling of microbiome data. Sci Rep 2021; 11:14505. [PMID: 34267244 PMCID: PMC8282688 DOI: 10.1038/s41598-021-93645-3] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 06/22/2021] [Indexed: 01/05/2023] Open
Abstract
Modern high-throughput sequencing technologies provide low-cost microbiome survey data across all habitats of life at unprecedented scale. At the most granular level, the primary data consist of sparse counts of amplicon sequence variants or operational taxonomic units that are associated with taxonomic and phylogenetic group information. In this contribution, we leverage the hierarchical structure of amplicon data and propose a data-driven and scalable tree-guided aggregation framework to associate microbial subcompositions with response variables of interest. The excess number of zero or low count measurements at the read level forces traditional microbiome data analysis workflows to remove rare sequencing variants or group them by a fixed taxonomic rank, such as genus or phylum, or by phylogenetic similarity. By contrast, our framework, which we call trac (tree-aggregation of compositional data), learns data-adaptive taxon aggregation levels for predictive modeling, greatly reducing the need for user-defined aggregation in preprocessing while simultaneously integrating seamlessly into the compositional data analysis framework. We illustrate the versatility of our framework in the context of large-scale regression problems in human gut, soil, and marine microbial ecosystems. We posit that the inferred aggregation levels provide highly interpretable taxon groupings that can help microbiome researchers gain insights into the structure and functioning of the underlying ecosystem of interest.
Collapse
Affiliation(s)
- Jacob Bien
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA, USA
| | | | - Léo Simpson
- Technische Universität München, Munich, Germany
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
| | - Christian L Müller
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany.
- Department of Statistics, Ludwig-Maximilians-Universität München, Munich, Germany.
- Center for Computational Mathematics, Flatiron Institute, Simons Foundation, New York, NY, USA.
| |
Collapse
|
14
|
Goren E, Wang C, He Z, Sheflin AM, Chiniquy D, Prenni JE, Tringe S, Schachtman DP, Liu P. Feature selection and causal analysis for microbiome studies in the presence of confounding using standardization. BMC Bioinformatics 2021; 22:362. [PMID: 34229628 PMCID: PMC8261956 DOI: 10.1186/s12859-021-04232-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Accepted: 06/03/2021] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND Microbiome studies have uncovered associations between microbes and human, animal, and plant health outcomes. This has led to an interest in developing microbial interventions for treatment of disease and optimization of crop yields which requires identification of microbiome features that impact the outcome in the population of interest. That task is challenging because of the high dimensionality of microbiome data and the confounding that results from the complex and dynamic interactions among host, environment, and microbiome. In the presence of such confounding, variable selection and estimation procedures may have unsatisfactory performance in identifying microbial features with an effect on the outcome. RESULTS In this manuscript, we aim to estimate population-level effects of individual microbiome features while controlling for confounding by a categorical variable. Due to the high dimensionality and confounding-induced correlation between features, we propose feature screening, selection, and estimation conditional on each stratum of the confounder followed by a standardization approach to estimation of population-level effects of individual features. Comprehensive simulation studies demonstrate the advantages of our approach in recovering relevant features. Utilizing a potential-outcomes framework, we outline assumptions required to ascribe causal, rather than associational, interpretations to the identified microbiome effects. We conducted an agricultural study of the rhizosphere microbiome of sorghum in which nitrogen fertilizer application is a confounding variable. In this study, the proposed approach identified microbial taxa that are consistent with biological understanding of potential plant-microbe interactions. CONCLUSIONS Standardization enables more accurate identification of individual microbiome features with an effect on the outcome of interest compared to other variable selection and estimation procedures when there is confounding by a categorical variable.
Collapse
Affiliation(s)
- Emily Goren
- Department of Statistics, Iowa State University, 2438 Osborn Dr, Ames, IA, 50011, USA
| | - Chong Wang
- Department of Statistics, Iowa State University, 2438 Osborn Dr, Ames, IA, 50011, USA.,Department of Veterinary Diagnostic and Production Animal Medicine, Iowa State University, 2203 Lloyd Veterinary Medical Center, Ames, IA, 50011, USA
| | - Zhulin He
- Department of Statistics, Iowa State University, 2438 Osborn Dr, Ames, IA, 50011, USA
| | - Amy M Sheflin
- Department of Horticulture and Landscape Architecture, Colorado State University, 301 University Ave, Fort Collins, CO, 80523, USA
| | - Dawn Chiniquy
- Department of Energy, Joint Genome Institute, 2800 Mitchell Dr, Walnut Creek, CA, 94598, USA
| | - Jessica E Prenni
- Department of Horticulture and Landscape Architecture, Colorado State University, 301 University Ave, Fort Collins, CO, 80523, USA
| | - Susannah Tringe
- Department of Energy, Joint Genome Institute, 2800 Mitchell Dr, Walnut Creek, CA, 94598, USA
| | - Daniel P Schachtman
- Department of Agronomy and Horticulture, University of Nebraska, 1825 N 38th St, Lincoln, NE, 68583, USA
| | - Peng Liu
- Department of Statistics, Iowa State University, 2438 Osborn Dr, Ames, IA, 50011, USA.
| |
Collapse
|
15
|
Jiang R, Li WV, Li JJ. mbImpute: an accurate and robust imputation method for microbiome data. Genome Biol 2021; 22:192. [PMID: 34183041 PMCID: PMC8240317 DOI: 10.1186/s13059-021-02400-4] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2021] [Accepted: 06/04/2021] [Indexed: 12/22/2022] Open
Abstract
A critical challenge in microbiome data analysis is the existence of many non-biological zeros, which distort taxon abundance distributions, complicate data analysis, and jeopardize the reliability of scientific discoveries. To address this issue, we propose the first imputation method for microbiome data-mbImpute-to identify and recover likely non-biological zeros by borrowing information jointly from similar samples, similar taxa, and optional metadata including sample covariates and taxon phylogeny. We demonstrate that mbImpute improves the power of identifying disease-related taxa from microbiome data of type 2 diabetes and colorectal cancer, and mbImpute preserves non-zero distributions of taxa abundances.
Collapse
Affiliation(s)
- Ruochen Jiang
- Department of Statistics, University of California, Los Angeles, 90095-1554, CA, USA
| | - Wei Vivian Li
- Department of Statistics, University of California, Los Angeles, 90095-1554, CA, USA
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, Piscataway, 08854, NJ, USA
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, 90095-1554, CA, USA.
- Department of Human Genetics, University of California, Los Angeles, 90095-7088, CA, USA.
- Department of Computational Medicine, University of California, Los Angeles, 90095-1766, CA, USA.
- Department of Biostatistics, University of California, Los Angeles, 90095-1772, CA, USA.
| |
Collapse
|
16
|
Marcos-Zambrano LJ, Karaduzovic-Hadziabdic K, Loncar Turukalo T, Przymus P, Trajkovik V, Aasmets O, Berland M, Gruca A, Hasic J, Hron K, Klammsteiner T, Kolev M, Lahti L, Lopes MB, Moreno V, Naskinova I, Org E, Paciência I, Papoutsoglou G, Shigdel R, Stres B, Vilne B, Yousef M, Zdravevski E, Tsamardinos I, Carrillo de Santa Pau E, Claesson MJ, Moreno-Indias I, Truu J. Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment. Front Microbiol 2021; 12:634511. [PMID: 33737920 PMCID: PMC7962872 DOI: 10.3389/fmicb.2021.634511] [Citation(s) in RCA: 113] [Impact Index Per Article: 37.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Accepted: 02/01/2021] [Indexed: 12/19/2022] Open
Abstract
The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.
Collapse
Affiliation(s)
- Laura Judith Marcos-Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | | | | | - Piotr Przymus
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruń, Poland
| | - Vladimir Trajkovik
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Skopje, North Macedonia
| | - Oliver Aasmets
- Institute of Genomics, Estonian Genome Centre, University of Tartu, Tartu, Estonia
- Department of Biotechnology, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | - Magali Berland
- Université Paris-Saclay, INRAE, MGP, Jouy-en-Josas, France
| | - Aleksandra Gruca
- Department of Computer Networks and Systems, Silesian University of Technology, Gliwice, Poland
| | - Jasminka Hasic
- University Sarajevo School of Science and Technology, Sarajevo, Bosnia and Herzegovina
| | - Karel Hron
- Department of Mathematical Analysis and Applications of Mathematics, Palacký University, Olomouc, Czechia
| | | | - Mikhail Kolev
- South West University “Neofit Rilski”, Blagoevgrad, Bulgaria
| | - Leo Lahti
- Department of Computing, University of Turku, Turku, Finland
| | - Marta B. Lopes
- NOVA Laboratory for Computer Science and Informatics (NOVA LINCS), FCT, UNL, Caparica, Portugal
- Centro de Matemática e Aplicações (CMA), FCT, UNL, Caparica, Portugal
| | - Victor Moreno
- Oncology Data Analytics Program, Catalan Institute of Oncology (ICO)Barcelona, Spain
- Colorectal Cancer Group, Institut de Recerca Biomedica de Bellvitge (IDIBELL), Barcelona, Spain
- Consortium for Biomedical Research in Epidemiology and Public Health (CIBERESP), Barcelona, Spain
- Department of Clinical Sciences, Faculty of Medicine, University of Barcelona, Barcelona, Spain
| | - Irina Naskinova
- South West University “Neofit Rilski”, Blagoevgrad, Bulgaria
| | - Elin Org
- Institute of Genomics, Estonian Genome Centre, University of Tartu, Tartu, Estonia
| | - Inês Paciência
- EPIUnit – Instituto de Saúde Pública da Universidade do Porto, Porto, Portugal
| | | | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Blaz Stres
- Group for Microbiology and Microbial Biotechnology, Department of Animal Science, University of Ljubljana, Ljubljana, Slovenia
| | - Baiba Vilne
- Bioinformatics Research Unit, Riga Stradins University, Riga, Latvia
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel
| | - Eftim Zdravevski
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Skopje, North Macedonia
| | | | | | - Marcus J. Claesson
- School of Microbiology & APC Microbiome Ireland, University College Cork, Cork, Ireland
| | - Isabel Moreno-Indias
- Unidad de Gestión Clínica de Endocrinología y Nutrición, Instituto de Investigación Biomédica de Málaga (IBIMA), Hospital Clínico Universitario Virgen de la Victoria, Universidad de Málaga, Málaga, Spain
- Centro de Investigación Biomédica en Red de Fisiopatología de la Obesidad y la Nutrición (CIBEROBN), Instituto de Salud Carlos III, Madrid, Spain
| | - Jaak Truu
- Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| |
Collapse
|
17
|
Fadel WF, Urbanek JK, Glynn NW, Harezlak J. Use of Functional Linear Models to Detect Associations between Characteristics of Walking and Continuous Responses Using Accelerometry Data. SENSORS 2020; 20:s20216394. [PMID: 33182460 PMCID: PMC7665147 DOI: 10.3390/s20216394] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/01/2020] [Revised: 11/03/2020] [Accepted: 11/06/2020] [Indexed: 11/16/2022]
Abstract
Various methods exist to measure physical activity. Subjective methods, such as diaries and surveys, are relatively inexpensive ways of measuring one’s physical activity; however, they are prone to measurement error and bias due to self-reporting. Wearable accelerometers offer a non-invasive and objective measure of one’s physical activity and are now widely used in observational studies. Accelerometers record high frequency data and each produce an unlabeled time series at the sub-second level. An important activity to identify from the data collected is walking, since it is often the only form of activity for certain populations. Currently, most methods use an activity summary which ignores the nuances of walking data. We propose methodology to model specific continuous responses with a functional linear model utilizing spectra obtained from the local fast Fourier transform (FFT) of walking as a predictor. Utilizing prior knowledge of the mechanics of walking, we incorporate this as additional information for the structure of our transformed walking spectra. The methods were applied to the in-the-laboratory data obtained from the Developmental Epidemiologic Cohort Study (DECOS).
Collapse
Affiliation(s)
- William F. Fadel
- Department of Biostatistics, Fairbanks School of Public Health, Indiana University, Indianapolis, IN 46202, USA
- Correspondence: (W.F.F.); (J.H.)
| | - Jacek K. Urbanek
- Department of Medicine, Division of Geriatric Medicine and Gerontology, School of Medicine, Johns Hopkins University, Baltimore, MD 21205, USA;
| | - Nancy W. Glynn
- Department of Epidemiology, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA;
| | - Jaroslaw Harezlak
- Department of Epidemiology and Biostatistics, Indiana University, Bloomington, IN 47405, USA
- Correspondence: (W.F.F.); (J.H.)
| |
Collapse
|
18
|
Affiliation(s)
| | - Jacob Bien
- Data Sciences and Operations, USC Marshall, Los Angeles, CA
| |
Collapse
|
19
|
Xia Y. Correlation and association analyses in microbiome study integrating multiomics in health and disease. PROGRESS IN MOLECULAR BIOLOGY AND TRANSLATIONAL SCIENCE 2020; 171:309-491. [PMID: 32475527 DOI: 10.1016/bs.pmbts.2020.04.003] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Correlation and association analyses are one of the most widely used statistical methods in research fields, including microbiome and integrative multiomics studies. Correlation and association have two implications: dependence and co-occurrence. Microbiome data are structured as phylogenetic tree and have several unique characteristics, including high dimensionality, compositionality, sparsity with excess zeros, and heterogeneity. These unique characteristics cause several statistical issues when analyzing microbiome data and integrating multiomics data, such as large p and small n, dependency, overdispersion, and zero-inflation. In microbiome research, on the one hand, classic correlation and association methods are still applied in real studies and used for the development of new methods; on the other hand, new methods have been developed to target statistical issues arising from unique characteristics of microbiome data. Here, we first provide a comprehensive view of classic and newly developed univariate correlation and association-based methods. We discuss the appropriateness and limitations of using classic methods and demonstrate how the newly developed methods mitigate the issues of microbiome data. Second, we emphasize that concepts of correlation and association analyses have been shifted by introducing network analysis, microbe-metabolite interactions, functional analysis, etc. Third, we introduce multivariate correlation and association-based methods, which are organized by the categories of exploratory, interpretive, and discriminatory analyses and classification methods. Fourth, we focus on the hypothesis testing of univariate and multivariate regression-based association methods, including alpha and beta diversities-based, count-based, and relative abundance (or compositional)-based association analyses. We demonstrate the characteristics and limitations of each approaches. Fifth, we introduce two specific microbiome-based methods: phylogenetic tree-based association analysis and testing for survival outcomes. Sixth, we provide an overall view of longitudinal methods in analysis of microbiome and omics data, which cover standard, static, regression-based time series methods, principal trend analysis, and newly developed univariate overdispersed and zero-inflated as well as multivariate distance/kernel-based longitudinal models. Finally, we comment on current association analysis and future direction of association analysis in microbiome and multiomics studies.
Collapse
Affiliation(s)
- Yinglin Xia
- Department of Medicine, University of Illinois at Chicago, Chicago, IL, United States.
| |
Collapse
|
20
|
Zhou YH, Gallins P. A Review and Tutorial of Machine Learning Methods for Microbiome Host Trait Prediction. Front Genet 2019; 10:579. [PMID: 31293616 PMCID: PMC6603228 DOI: 10.3389/fgene.2019.00579] [Citation(s) in RCA: 88] [Impact Index Per Article: 17.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2019] [Accepted: 06/04/2019] [Indexed: 12/19/2022] Open
Abstract
With the growing importance of microbiome research, there is increasing evidence that host variation in microbial communities is associated with overall host health. Advancement in genetic sequencing methods for microbiomes has coincided with improvements in machine learning, with important implications for disease risk prediction in humans. One aspect specific to microbiome prediction is the use of taxonomy-informed feature selection. In this review for non-experts, we explore the most commonly used machine learning methods, and evaluate their prediction accuracy as applied to microbiome host trait prediction. Methods are described at an introductory level, and R/Python code for the analyses is provided.
Collapse
Affiliation(s)
- Yi-Hui Zhou
- Department of Biological Sciences, North Carolina State University, Raleigh, NC, United States
| | - Paul Gallins
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, United States
| |
Collapse
|
21
|
Fukuyama J. Adaptive gPCA: A method for structured dimensionality reduction with applications to microbiome data. Ann Appl Stat 2019. [DOI: 10.1214/18-aoas1227] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
22
|
Xiao J, Chen L, Yu Y, Zhang X, Chen J. A Phylogeny-Regularized Sparse Regression Model for Predictive Modeling of Microbial Community Data. Front Microbiol 2018; 9:3112. [PMID: 30619188 PMCID: PMC6305753 DOI: 10.3389/fmicb.2018.03112] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2018] [Accepted: 12/03/2018] [Indexed: 12/16/2022] Open
Abstract
Fueled by technological advancement, there has been a surge of human microbiome studies surveying the microbial communities associated with the human body and their links with health and disease. As a complement to the human genome, the human microbiome holds great potential for precision medicine. Efficient predictive models based on microbiome data could be potentially used in various clinical applications such as disease diagnosis, patient stratification and drug response prediction. One important characteristic of the microbial community data is the phylogenetic tree that relates all the microbial taxa based on their evolutionary history. The phylogenetic tree is an informative prior for more efficient prediction since the microbial community changes are usually not randomly distributed on the tree but tend to occur in clades at varying phylogenetic depths (clustered signal). Although community-wide changes are possible for some conditions, it is also likely that the community changes are only associated with a small subset of "marker" taxa (sparse signal). Unfortunately, predictive models of microbial community data taking into account both the sparsity and the tree structure remain under-developed. In this paper, we propose a predictive framework to exploit sparse and clustered microbiome signals using a phylogeny-regularized sparse regression model. Our approach is motivated by evolutionary theory, where a natural correlation structure among microbial taxa exists according to the phylogenetic relationship. A novel phylogeny-based smoothness penalty is proposed to smooth the coefficients of the microbial taxa with respect to the phylogenetic tree. Using simulated and real datasets, we show that our method achieves better prediction performance than competing sparse regression methods for sparse and clustered microbiome signals.
Collapse
Affiliation(s)
- Jian Xiao
- Division of Biomedical Statistics and Informatics, Center for Individualized Medicine, Mayo Clinic Rochester, MN, United States.,School of Statistics and Mathematics Zhongnan University of Economics and Law, Wuhan, China
| | - Li Chen
- Department of Health Outcomes Research and Policy, Harrison School of Pharmacy, Auburn University Auburn, AL, United States
| | - Yue Yu
- Division of Biomedical Statistics and Informatics, Center for Individualized Medicine, Mayo Clinic Rochester, MN, United States
| | - Xianyang Zhang
- Department of Statistics, Texas A&M University College Station, TX, United States
| | - Jun Chen
- Division of Biomedical Statistics and Informatics, Center for Individualized Medicine, Mayo Clinic Rochester, MN, United States
| |
Collapse
|
23
|
Li Z, Lee K, Karagas MR, Madan JC, Hoen AG, O'Malley AJ, Li H. Conditional Regression Based on a Multivariate Zero-Inflated Logistic-Normal Model for Microbiome Relative Abundance Data. STATISTICS IN BIOSCIENCES 2018; 10:587-608. [PMID: 30923584 DOI: 10.1007/s12561-018-9219-2] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
The human microbiome plays critical roles in human health and has been linked to many diseases. While advanced sequencing technologies can characterize the composition of the microbiome in unprecedented detail, it remains challenging to disentangle the complex interplay between human microbiome and disease risk factors due to the complicated nature of microbiome data. Excessive numbers e f zero values, high dimensionality, the hierarchical phylogenetic tree and compositional structure are compounded and consequently make existing methods inadequate to appropriately address these issues. We propose a multivariate two-part zero-inflated logistic normal (MZILN) model to analyze the association of disease risk factors with individual microbial taxa and overall microbial community composition. This approach can naturally handle excessive numbers e f zeros and the compositional data structure with the discrete part and the logistic-normal part e f the model. For parameter estimation, an estimating equations approach is employed that enables us to address the complex inter-taxa correlation structure induced by the hierarchical phylogenetic tree structure and the compositional data structure. This model is able to incorporate standard regularization approaches to deal with high dimensionality. Simulation shews that our model outperforms existing methods. Our approach is also compared to ethers using the analysis of real microbiome data.
Collapse
Affiliation(s)
- Zhigang Li
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, 1 Medical Center Drive, Lebanon, NK 03756, USA.,Children's Environmental Health and Disease Prevention Research Center at Dartmouth, Hanever, New Hampshire.,Department of Epidemiology, Geisel School of Medicine at Dartmouth, 1 Medical Center Drive, Lebanon, NH 03756, USA.,Department of Biestatistics, University e f Florida, Gainesville, fL 32611, USA
| | | | - Margaret R Karagas
- Children's Environmental Health and Disease Prevention Research Center at Dartmouth, Hanever, New Hampshire.,Department of Epidemiology, Geisel School of Medicine at Dartmouth, 1 Medical Center Drive, Lebanon, NH 03756, USA
| | - Juliette C Madan
- Children's Environmental Health and Disease Prevention Research Center at Dartmouth, Hanever, New Hampshire.,Department of Epidemiology, Geisel School of Medicine at Dartmouth, 1 Medical Center Drive, Lebanon, NH 03756, USA.,Division of Neenatelegy, Department of Pediatrics, Children's Hospital at Dartmouth, Lebanon, New Kampshire
| | - Anne G Hoen
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, 1 Medical Center Drive, Lebanon, NK 03756, USA.,Children's Environmental Health and Disease Prevention Research Center at Dartmouth, Hanever, New Hampshire.,Department of Epidemiology, Geisel School of Medicine at Dartmouth, 1 Medical Center Drive, Lebanon, NH 03756, USA
| | - A James O'Malley
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, 1 Medical Center Drive, Lebanon, NK 03756, USA.,The Dartmouth Institute for Kealth Policy and Clinical Practice, Geisel School e f Medicine at Dartmouth, 1 Medical Center Drive, Lebanon, NK 03756, USA
| | - Hongzhe Li
- Department of Biestatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, PA 19104, USA
| |
Collapse
|
24
|
Mallick H, Ma S, Franzosa EA, Vatanen T, Morgan XC, Huttenhower C. Experimental design and quantitative analysis of microbial community multiomics. Genome Biol 2017; 18:228. [PMID: 29187204 PMCID: PMC5708111 DOI: 10.1186/s13059-017-1359-z] [Citation(s) in RCA: 112] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Studies of the microbiome have become increasingly sophisticated, and multiple sequence-based, molecular methods as well as culture-based methods exist for population-scale microbiome profiles. To link the resulting host and microbial data types to human health, several experimental design considerations, data analysis challenges, and statistical epidemiological approaches must be addressed. Here, we survey current best practices for experimental design in microbiome molecular epidemiology, including technologies for generating, analyzing, and integrating microbiome multiomics data. We highlight studies that have identified molecular bioactives that influence human health, and we suggest steps for scaling translational microbiome research to high-throughput target discovery across large populations.
Collapse
Affiliation(s)
- Himel Mallick
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA.,Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Siyuan Ma
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA.,Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Eric A Franzosa
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA.,Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Tommi Vatanen
- Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Xochitl C Morgan
- Department of Microbiology and Immunology, The University of Otago, Dunedin, New Zealand
| | - Curtis Huttenhower
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA. .,Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA.
| |
Collapse
|