1
|
Van Booven DJ, Chen CB, Malpani S, Mirzabeigi Y, Mohammadi M, Wang Y, Kryvenko ON, Punnen S, Arora H. Synthetic Genitourinary Image Synthesis via Generative Adversarial Networks: Enhancing Artificial Intelligence Diagnostic Precision. J Pers Med 2024; 14:703. [PMID: 39063957 PMCID: PMC11278131 DOI: 10.3390/jpm14070703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Revised: 06/20/2024] [Accepted: 06/24/2024] [Indexed: 07/28/2024] Open
Abstract
INTRODUCTION In the realm of computational pathology, the scarcity and restricted diversity of genitourinary (GU) tissue datasets pose significant challenges for training robust diagnostic models. This study explores the potential of Generative Adversarial Networks (GANs) to mitigate these limitations by generating high-quality synthetic images of rare or underrepresented GU tissues. We hypothesized that augmenting the training data of computational pathology models with these GAN-generated images, validated through pathologist evaluation and quantitative similarity measures, would significantly enhance model performance in tasks such as tissue classification, segmentation, and disease detection. METHODS To test this hypothesis, we employed a GAN model to produce synthetic images of eight different GU tissues. The quality of these images was rigorously assessed using a Relative Inception Score (RIS) of 1.27 ± 0.15 and a Fréchet Inception Distance (FID) that stabilized at 120, metrics that reflect the visual and statistical fidelity of the generated images to real histopathological images. Additionally, the synthetic images received an 80% approval rating from board-certified pathologists, further validating their realism and diagnostic utility. We used an alternative Spatial Heterogeneous Recurrence Quantification Analysis (SHRQA) to assess the quality of prostate tissue. This allowed us to make a comparison between original and synthetic data in the context of features, which were further validated by the pathologist's evaluation. Future work will focus on implementing a deep learning model to evaluate the performance of the augmented datasets in tasks such as tissue classification, segmentation, and disease detection. This will provide a more comprehensive understanding of the utility of GAN-generated synthetic images in enhancing computational pathology workflows. RESULTS This study not only confirms the feasibility of using GANs for data augmentation in medical image analysis but also highlights the critical role of synthetic data in addressing the challenges of dataset scarcity and imbalance. CONCLUSIONS Future work will focus on refining the generative models to produce even more diverse and complex tissue representations, potentially transforming the landscape of medical diagnostics with AI-driven solutions.
Collapse
Affiliation(s)
- Derek J. Van Booven
- John P Hussman Institute for Human Genomics, Miller School of Medicine, University of Miami, Miami, FL 33136, USA;
| | - Cheng-Bang Chen
- Department of Industrial and Systems Engineering, University of Miami, Coral Gables, FL 33146, USA; (C.-B.C.); (Y.W.)
| | - Sheetal Malpani
- Department of Pathology, Miller School of Medicine, University of Miami, Miami, FL 33136, USA; (S.M.); (Y.M.); (O.N.K.)
| | - Yasamin Mirzabeigi
- Department of Pathology, Miller School of Medicine, University of Miami, Miami, FL 33136, USA; (S.M.); (Y.M.); (O.N.K.)
| | - Maral Mohammadi
- Department of Pathology, University of Debrecen in Hungary, 4032 Debrecen, Hungary;
| | - Yujie Wang
- Department of Industrial and Systems Engineering, University of Miami, Coral Gables, FL 33146, USA; (C.-B.C.); (Y.W.)
| | - Oleksander N. Kryvenko
- Department of Pathology, Miller School of Medicine, University of Miami, Miami, FL 33136, USA; (S.M.); (Y.M.); (O.N.K.)
| | - Sanoj Punnen
- Desai & Sethi Institute of Urology, Miller School of Medicine, University of Miami, Miami, FL 33136, USA;
| | - Himanshu Arora
- John P Hussman Institute for Human Genomics, Miller School of Medicine, University of Miami, Miami, FL 33136, USA;
- Department of Pathology, University of Debrecen in Hungary, 4032 Debrecen, Hungary;
- Desai & Sethi Institute of Urology, Miller School of Medicine, University of Miami, Miami, FL 33136, USA;
- The Interdisciplinary Stem Cell Institute, Miller School of Medicine, University of Miami, Miami, FL 33136, USA
| |
Collapse
|
2
|
Deek RA, Ma S, Lewis J, Li H. Statistical and computational methods for integrating microbiome, host genomics, and metabolomics data. eLife 2024; 13:e88956. [PMID: 38832759 PMCID: PMC11149933 DOI: 10.7554/elife.88956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Accepted: 05/10/2024] [Indexed: 06/05/2024] Open
Abstract
Large-scale microbiome studies are progressively utilizing multiomics designs, which include the collection of microbiome samples together with host genomics and metabolomics data. Despite the increasing number of data sources, there remains a bottleneck in understanding the relationships between different data modalities due to the limited number of statistical and computational methods for analyzing such data. Furthermore, little is known about the portability of general methods to the metagenomic setting and few specialized techniques have been developed. In this review, we summarize and implement some of the commonly used methods. We apply these methods to real data sets where shotgun metagenomic sequencing and metabolomics data are available for microbiome multiomics data integration analysis. We compare results across methods, highlight strengths and limitations of each, and discuss areas where statistical and computational innovation is needed.
Collapse
Affiliation(s)
- Rebecca A Deek
- Department of Biostatistics, University of PittsburghPittsburghUnited States
| | - Siyuan Ma
- Department of Biostatistics, Vanderbilt School of MedicineNashvilleUnited States
| | - James Lewis
- Division of Gastroenterology and Hepatology, Perelman School of Medicine, University of PennsylvaniaPhiladelphiaUnited States
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of PennsylvaniaPhiladelphiaUnited States
| |
Collapse
|
3
|
Mishra AK, Mahmud I, Lorenzi PL, Jenq RR, Wargo JA, Ajami NJ, Peterson CB. TARO: tree-aggregated factor regression for microbiome data integration. Bioinformatics 2024; 40:btae321. [PMID: 38788190 PMCID: PMC11193058 DOI: 10.1093/bioinformatics/btae321] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 03/16/2024] [Accepted: 05/15/2024] [Indexed: 05/26/2024] Open
Abstract
MOTIVATION Although the human microbiome plays a key role in health and disease, the biological mechanisms underlying the interaction between the microbiome and its host are incompletely understood. Integration with other molecular profiling data offers an opportunity to characterize the role of the microbiome and elucidate therapeutic targets. However, this remains challenging to the high dimensionality, compositionality, and rare features found in microbiome profiling data. These challenges necessitate the use of methods that can achieve structured sparsity in learning cross-platform association patterns. RESULTS We propose Tree-Aggregated factor RegressiOn (TARO) for the integration of microbiome and metabolomic data. We leverage information on the taxonomic tree structure to flexibly aggregate rare features. We demonstrate through simulation studies that TARO accurately recovers a low-rank coefficient matrix and identifies relevant features. We applied TARO to microbiome and metabolomic profiles gathered from subjects being screened for colorectal cancer to understand how gut microrganisms shape intestinal metabolite abundances. AVAILABILITY AND IMPLEMENTATION The R package TARO implementing the proposed methods is available online at https://github.com/amishra-stats/taro-package.
Collapse
Affiliation(s)
- Aditya K Mishra
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
- Platform for Innovative Microbiome and Translational Research (PRIME-TR), The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Iqbal Mahmud
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Philip L Lorenzi
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Robert R Jenq
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
- Platform for Innovative Microbiome and Translational Research (PRIME-TR), The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Jennifer A Wargo
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
- Platform for Innovative Microbiome and Translational Research (PRIME-TR), The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
- Department of Surgical Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Nadim J Ajami
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
- Platform for Innovative Microbiome and Translational Research (PRIME-TR), The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Christine B Peterson
- Platform for Innovative Microbiome and Translational Research (PRIME-TR), The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| |
Collapse
|
4
|
Chi J, Ye J, Zhou Y. A GLM-based zero-inflated generalized Poisson factor model for analyzing microbiome data. Front Microbiol 2024; 15:1394204. [PMID: 38873138 PMCID: PMC11173601 DOI: 10.3389/fmicb.2024.1394204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Accepted: 05/20/2024] [Indexed: 06/15/2024] Open
Abstract
Motivation High-throughput sequencing technology facilitates the quantitative analysis of microbial communities, improving the capacity to investigate the associations between the human microbiome and diseases. Our primary motivating application is to explore the association between gut microbes and obesity. The complex characteristics of microbiome data, including high dimensionality, zero inflation, and over-dispersion, pose new statistical challenges for downstream analysis. Results We propose a GLM-based zero-inflated generalized Poisson factor analysis (GZIGPFA) model to analyze microbiome data with complex characteristics. The GZIGPFA model is based on a zero-inflated generalized Poisson (ZIGP) distribution for modeling microbiome count data. A link function between the generalized Poisson rate and the probability of excess zeros is established within the generalized linear model (GLM) framework. The latent parameters of the GZIGPFA model constitute a low-rank matrix comprising a low-dimensional score matrix and a loading matrix. An alternating maximum likelihood algorithm is employed to estimate the unknown parameters, and cross-validation is utilized to determine the rank of the model in this study. The proposed GZIGPFA model demonstrates superior performance and advantages through comprehensive simulation studies and real data applications.
Collapse
Affiliation(s)
- Jinling Chi
- School of Mathematics and Statistics, Xidian University, Xi'an, China
| | - Jimin Ye
- School of Mathematics and Statistics, Xidian University, Xi'an, China
| | - Ying Zhou
- School of Mathematical Sciences, Heilongjiang University, Harbin, China
| |
Collapse
|
5
|
Ozminkowski S, Solís‐Lemus C. Identifying microbial drivers in biological phenotypes with a Bayesian network regression model. Ecol Evol 2024; 14:e11039. [PMID: 38774136 PMCID: PMC11106058 DOI: 10.1002/ece3.11039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Revised: 01/29/2024] [Accepted: 02/03/2024] [Indexed: 05/24/2024] Open
Abstract
In Bayesian Network Regression models, networks are considered the predictors of continuous responses. These models have been successfully used in brain research to identify regions in the brain that are associated with specific human traits, yet their potential to elucidate microbial drivers in biological phenotypes for microbiome research remains unknown. In particular, microbial networks are challenging due to their high dimension and high sparsity compared to brain networks. Furthermore, unlike in brain connectome research, in microbiome research, it is usually expected that the presence of microbes has an effect on the response (main effects), not just the interactions. Here, we develop the first thorough investigation of whether Bayesian Network Regression models are suitable for microbial datasets on a variety of synthetic and real data under diverse biological scenarios. We test whether the Bayesian Network Regression model that accounts only for interaction effects (edges in the network) is able to identify key drivers (microbes) in phenotypic variability. We show that this model is indeed able to identify influential nodes and edges in the microbial networks that drive changes in the phenotype for most biological settings, but we also identify scenarios where this method performs poorly which allows us to provide practical advice for domain scientists aiming to apply these tools to their datasets. BNR models provide a framework for microbiome researchers to identify connections between microbes and measured phenotypes. We allow the use of this statistical model by providing an easy-to-use implementation which is publicly available Julia package at https://github.com/solislemuslab/BayesianNetworkRegression.jl.
Collapse
Affiliation(s)
- Samuel Ozminkowski
- Department of Statistics and Wisconsin Institute for DiscoveryUniversity of Wisconsin‐MadisonMadisonWisconsinUSA
| | - Claudia Solís‐Lemus
- Department of Plant Pathology and Wisconsin Institute for DiscoveryUniversity of Wisconsin‐MadisonMadisonWisconsinUSA
| |
Collapse
|
6
|
Zhang L, Zhang X, Leach JM, Rahman AF, Yi N. Bayesian compositional models for ordinal response. Stat Methods Med Res 2024:9622802241247730. [PMID: 38654396 DOI: 10.1177/09622802241247730] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/25/2024]
Abstract
Ordinal response is commonly found in medicine, biology, and other fields. In many situations, the predictors for this ordinal response are compositional, which means that the sum of predictors for each sample is fixed. Examples of compositional data include the relative abundance of species in microbiome data and the relative frequency of nutrition concentrations. Moreover, the predictors that are strongly correlated tend to have similar influence on the response outcome. Conventional cumulative logistic regression models for ordinal responses ignore the fixed-sum constraint on predictors and their associated interrelationships, and thus are not appropriate for analyzing compositional predictors.To solve this problem, we proposed Bayesian Compositional Models for Ordinal Response to analyze the relationship between compositional data and an ordinal response with a structured regularized horseshoe prior for the compositional coefficients and a soft sum-to-zero restriction on coefficients through the prior distribution. The method was implemented with R package rstan using efficient Hamiltonian Monte Carlo algorithm. We performed simulations to compare the proposed approach and existing methods for ordinal responses. Results revealed that our proposed method outperformed the existing methods in terms of parameter estimation and prediction. We also applied the proposed method to a microbiome study HMP2Data, to find microorganisms linked to ordinal inflammatory bowel disease levels. To make this work reproducible, the code and data used in this paper are available at https://github.com/Li-Zhang28/BCO.
Collapse
Affiliation(s)
- Li Zhang
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Xinyan Zhang
- School of Data Science and Analytics, Kennesaw State University, Kennesaw, GA, USA
| | - Justin M Leach
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Akm F Rahman
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Nengjun Yi
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
| |
Collapse
|
7
|
Abstract
The microbiome represents a hidden world of tiny organisms populating not only our surroundings but also our own bodies. By enabling comprehensive profiling of these invisible creatures, modern genomic sequencing tools have given us an unprecedented ability to characterize these populations and uncover their outsize impact on our environment and health. Statistical analysis of microbiome data is critical to infer patterns from the observed abundances. The application and development of analytical methods in this area require careful consideration of the unique aspects of microbiome profiles. We begin this review with a brief overview of microbiome data collection and processing and describe the resulting data structure. We then provide an overview of statistical methods for key tasks in microbiome data analysis, including data visualization, comparison of microbial abundance across groups, regression modeling, and network inference. We conclude with a discussion and highlight interesting future directions.
Collapse
Affiliation(s)
- Christine B Peterson
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| | - Satabdi Saha
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| | - Kim-Anh Do
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| |
Collapse
|
8
|
Zhang L, Zhang X, Yi N. Bayesian compositional generalized linear models for analyzing microbiome data. Stat Med 2024; 43:141-155. [PMID: 37985956 DOI: 10.1002/sim.9946] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Revised: 10/07/2023] [Accepted: 10/12/2023] [Indexed: 11/22/2023]
Abstract
The crucial impact of the microbiome on human health and disease has gained significant scientific attention. Researchers seek to connect microbiome features with health conditions, aiming to predict diseases and develop personalized medicine strategies. However, the practicality of conventional models is restricted due to important aspects of microbiome data. Specifically, the data observed is compositional, as the counts within each sample are bound by a fixed-sum constraint. Moreover, microbiome data often exhibits high dimensionality, wherein the number of variables surpasses the available samples. In addition, microbiome features exhibiting phenotypical similarity usually have similar influence on the response variable. To address the challenges posed by these aspects of the data structure, we proposed Bayesian compositional generalized linear models for analyzing microbiome data (BCGLM) with a structured regularized horseshoe prior for the compositional coefficients and a soft sum-to-zero restriction on coefficients through the prior distribution. We fitted the proposed models using Markov Chain Monte Carlo (MCMC) algorithms with R package rstan. The performance of the proposed method was assessed by extensive simulation studies. The simulation results show that our approach outperforms existing methods with higher accuracy of coefficient estimates and lower prediction error. We also applied the proposed method to microbiome study to find microorganisms linked to inflammatory bowel disease (IBD). To make this work reproducible, the code and data used in this article are available at https://github.com/Li-Zhang28/BCGLM.
Collapse
Affiliation(s)
- Li Zhang
- Department of Biostatistics, University of Alabama at Birmingham, Alabama, USA
| | - Xinyan Zhang
- School of Data Science and Analytics, Kennesaw State University, Kennesaw, Georgia, USA
| | - Nengjun Yi
- Department of Biostatistics, University of Alabama at Birmingham, Alabama, USA
| |
Collapse
|
9
|
Fei T, Funnell T, Waters NR, Raj SS, Sadeghi K, Dai A, Miltiadous O, Shouval R, Lv M, Peled JU, Ponce DM, Perales MA, Gönen M, van den Brink MRM. Enhanced Feature Selection for Microbiome Data using FLORAL: Scalable Log-ratio Lasso Regression. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.02.538599. [PMID: 37205350 PMCID: PMC10187229 DOI: 10.1101/2023.05.02.538599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
Identifying predictive biomarkers of patient outcomes from high-throughput microbiome data is of high interest, while existing computational methods do not satisfactorily account for complex survival endpoints, longitudinal samples, and taxa-specific sequencing biases. We present FLORAL (https://vdblab.github.io/FLORAL/), an open-source computational tool to perform scalable log-ratio lasso regression and microbial feature selection for continuous, binary, time-to-event, and competing risk outcomes, with compatibility of longitudinal microbiome data as time-dependent covariates. The proposed method adapts the augmented Lagrangian algorithm for a zero-sum constraint optimization problem while enabling a two-stage screening process for extended false-positive control. In extensive simulation and real-data analyses, FLORAL achieved consistently better false-positive control compared to other lasso-based approaches, and better sensitivity over popular differential abundance testing methods for datasets with smaller sample size. In a survival analysis in allogeneic hematopoietic-cell transplant, we further demonstrated considerable improvement by FLORAL in microbial feature selection by utilizing longitudinal microbiome data over only using baseline microbiome data.
Collapse
Affiliation(s)
- Teng Fei
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center
| | - Tyler Funnell
- Department of Immunology, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center
| | - Nicholas R. Waters
- Department of Immunology, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center
| | - Sandeep S. Raj
- Department of Medicine, Memorial Sloan Kettering Cancer Center
| | - Keimya Sadeghi
- Department of Immunology, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center
| | - Anqi Dai
- Department of Immunology, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center
| | | | - Roni Shouval
- Adult Bone Marrow Transplantation Service, Department of Medicine, Memorial Sloan Kettering Cancer Center
- Department of Medicine, Weill Cornell Medical College
| | - Meng Lv
- Institute of Hematology, Peking University People’s Hospital
| | - Jonathan U. Peled
- Adult Bone Marrow Transplantation Service, Department of Medicine, Memorial Sloan Kettering Cancer Center
- Department of Medicine, Weill Cornell Medical College
| | - Doris M. Ponce
- Adult Bone Marrow Transplantation Service, Department of Medicine, Memorial Sloan Kettering Cancer Center
- Department of Medicine, Weill Cornell Medical College
| | - Miguel-Angel Perales
- Adult Bone Marrow Transplantation Service, Department of Medicine, Memorial Sloan Kettering Cancer Center
- Department of Medicine, Weill Cornell Medical College
| | - Mithat Gönen
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center
| | | |
Collapse
|
10
|
Mishra AK, Mahmud I, Lorenzi PL, Jenq RR, Wargo JA, Ajami NJ, Peterson CB. TARO: tree-aggregated factor regression for microbiome data integration. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.17.562792. [PMID: 37904958 PMCID: PMC10614880 DOI: 10.1101/2023.10.17.562792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/01/2023]
Abstract
Motivation Although the human microbiome plays a key role in health and disease, the biological mechanisms underlying the interaction between the microbiome and its host are incompletely understood. Integration with other molecular profiling data offers an opportunity to characterize the role of the microbiome and elucidate therapeutic targets. However, this remains challenging to the high dimensionality, compositionality, and rare features found in microbiome profiling data. These challenges necessitate the use of methods that can achieve structured sparsity in learning cross-platform association patterns. Results We propose Tree-Aggregated factor RegressiOn (TARO) for the integration of microbiome and metabolomic data. We leverage information on the phylogenetic tree structure to flexibly aggregate rare features. We demonstrate through simulation studies that TARO accurately recovers a low-rank coefficient matrix and identifies relevant features. We applied TARO to microbiome and metabolomic profiles gathered from subjects being screened for colorectal cancer to understand how gut microrganisms shape intestinal metabolite abundances. Availability and implementation The R package TARO implementing the proposed methods is available online at https://github.com/amishra-stats/taro-package .
Collapse
|
11
|
Li G, Li Y, Chen K. It's all relative: Regression analysis with compositional predictors. Biometrics 2023; 79:1318-1329. [PMID: 35616500 PMCID: PMC9767704 DOI: 10.1111/biom.13703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Accepted: 05/18/2022] [Indexed: 01/05/2023]
Abstract
Compositional data reside in a simplex and measure fractions or proportions of parts to a whole. Most existing regression methods for such data rely on log-ratio transformations that are inadequate or inappropriate in modeling high-dimensional data with excessive zeros and hierarchical structures. Moreover, such models usually lack a straightforward interpretation due to the interrelation between parts of a composition. We develop a novel relative-shift regression framework that directly uses proportions as predictors. The new framework provides a paradigm shift for regression analysis with compositional predictors and offers a superior interpretation of how shifting concentration between parts affects the response. New equi-sparsity and tree-guided regularization methods and an efficient smoothing proximal gradient algorithm are developed to facilitate feature aggregation and dimension reduction in regression. A unified finite-sample prediction error bound is derived for the proposed regularized estimators. We demonstrate the efficacy of the proposed methods in extensive simulation studies and a real gut microbiome study. Guided by the taxonomy of the microbiome data, the framework identifies important taxa at different taxonomic levels associated with the neurodevelopment of preterm infants.
Collapse
Affiliation(s)
- Gen Li
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor., Michigan, USA
| | - Yan Li
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor., Michigan, USA
| | - Kun Chen
- Department of Statistics, University of Connecticut, Connecticut, USA
| |
Collapse
|
12
|
Huang S, Ailer E, Kilbertus N, Pfister N. Supervised learning and model analysis with compositional data. PLoS Comput Biol 2023; 19:e1011240. [PMID: 37390111 PMCID: PMC10343141 DOI: 10.1371/journal.pcbi.1011240] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2023] [Revised: 07/13/2023] [Accepted: 06/03/2023] [Indexed: 07/02/2023] Open
Abstract
Supervised learning, such as regression and classification, is an essential tool for analyzing modern high-throughput sequencing data, for example in microbiome research. However, due to the compositionality and sparsity, existing techniques are often inadequate. Either they rely on extensions of the linear log-contrast model (which adjust for compositionality but cannot account for complex signals or sparsity) or they are based on black-box machine learning methods (which may capture useful signals, but lack interpretability due to the compositionality). We propose KernelBiome, a kernel-based nonparametric regression and classification framework for compositional data. It is tailored to sparse compositional data and is able to incorporate prior knowledge, such as phylogenetic structure. KernelBiome captures complex signals, including in the zero-structure, while automatically adapting model complexity. We demonstrate on par or improved predictive performance compared with state-of-the-art machine learning methods on 33 publicly available microbiome datasets. Additionally, our framework provides two key advantages: (i) We propose two novel quantities to interpret contributions of individual components and prove that they consistently estimate average perturbation effects of the conditional mean, extending the interpretability of linear log-contrast coefficients to nonparametric models. (ii) We show that the connection between kernels and distances aids interpretability and provides a data-driven embedding that can augment further analysis. KernelBiome is available as an open-source Python package on PyPI and at https://github.com/shimenghuang/KernelBiome.
Collapse
Affiliation(s)
- Shimeng Huang
- Department of Mathematical Sciences, University of Copenhagen, Copenhagen, Denmark
| | | | - Niki Kilbertus
- Helmholtz Munich, Munich, Germany
- Technical University of Munich, Munich, Germany
| | - Niklas Pfister
- Department of Mathematical Sciences, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
13
|
Scott DAV, Benavente E, Libiseller-Egger J, Fedorov D, Phelan J, Ilina E, Tikhonova P, Kudryavstev A, Galeeva J, Clark T, Lewin A. Bayesian compositional regression with microbiome features via variational inference. BMC Bioinformatics 2023; 24:210. [PMID: 37217852 PMCID: PMC10201722 DOI: 10.1186/s12859-023-05219-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Accepted: 03/02/2023] [Indexed: 05/24/2023] Open
Abstract
The microbiome plays a key role in the health of the human body. Interest often lies in finding features of the microbiome, alongside other covariates, which are associated with a phenotype of interest. One important property of microbiome data, which is often overlooked, is its compositionality as it can only provide information about the relative abundance of its constituting components. Typically, these proportions vary by several orders of magnitude in datasets of high dimensions. To address these challenges we develop a Bayesian hierarchical linear log-contrast model which is estimated by mean field Monte-Carlo co-ordinate ascent variational inference (CAVI-MC) and easily scales to high dimensional data. We use novel priors which account for the large differences in scale and constrained parameter space associated with the compositional covariates. A reversible jump Monte Carlo Markov chain guided by the data through univariate approximations of the variational posterior probability of inclusion, with proposal parameters informed by approximating variational densities via auxiliary parameters, is used to estimate intractable marginal expectations. We demonstrate that our proposed Bayesian method performs favourably against existing frequentist state of the art compositional data analysis methods. We then apply the CAVI-MC to the analysis of real data exploring the relationship of the gut microbiome to body mass index.
Collapse
Affiliation(s)
- Darren A. V. Scott
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, Keppel Street, London, United Kingdom
| | - Ernest Benavente
- Laboratory of Experimental Cardiology, University Medical Center Utrecht, Utrecht University, Utrecht, Netherlands
| | - Julian Libiseller-Egger
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, Keppel Street, London, United Kingdom
| | - Dmitry Fedorov
- Federal Research and Clinical Center of Physical-Chemical Medicine, Moscow, Russia
| | - Jody Phelan
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, Keppel Street, London, United Kingdom
| | - Elena Ilina
- Federal Research and Clinical Center of Physical-Chemical Medicine, Moscow, Russia
| | - Polina Tikhonova
- Federal Research and Clinical Center of Physical-Chemical Medicine, Moscow, Russia
- Bioinformatics and Genomics Intercollege Graduate Program, Huck Institutes of Life Sciences, Pennsylvania State University, Pennsylvania, USA
| | | | - Julia Galeeva
- Federal Research and Clinical Center of Physical-Chemical Medicine, Moscow, Russia
| | - Taane Clark
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, Keppel Street, London, United Kingdom
| | - Alex Lewin
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, Keppel Street, London, United Kingdom
| |
Collapse
|
14
|
Fu J, Koslovsky MD, Neophytou AM, Vannucci M. A Bayesian joint model for compositional mediation effect selection in microbiome data. Stat Med 2023. [PMID: 37173609 DOI: 10.1002/sim.9764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2022] [Revised: 04/17/2023] [Accepted: 04/26/2023] [Indexed: 05/15/2023]
Abstract
Analyzing multivariate count data generated by high-throughput sequencing technology in microbiome research studies is challenging due to the high-dimensional and compositional structure of the data and overdispersion. In practice, researchers are often interested in investigating how the microbiome may mediate the relation between an assigned treatment and an observed phenotypic response. Existing approaches designed for compositional mediation analysis are unable to simultaneously determine the presence of direct effects, relative indirect effects, and overall indirect effects, while quantifying their uncertainty. We propose a formulation of a Bayesian joint model for compositional data that allows for the identification, estimation, and uncertainty quantification of various causal estimands in high-dimensional mediation analysis. We conduct simulation studies and compare our method's mediation effects selection performance with existing methods. Finally, we apply our method to a benchmark data set investigating the sub-therapeutic antibiotic treatment effect on body weight in early-life mice.
Collapse
Affiliation(s)
- Jingyan Fu
- Department of Statistics, Rice University, Houston, Texas, USA
| | - Matthew D Koslovsky
- Department of Statistics, Colorado State University, Fort Collins, Colorado, USA
| | - Andreas M Neophytou
- Department of Environmental & Radiological Health Sciences, Colorado State University, Fort Collins, Colorado, USA
| | - Marina Vannucci
- Department of Statistics, Rice University, Houston, Texas, USA
| |
Collapse
|
15
|
Hong Q, Chen G, Tang ZZ. PhyloMed: a phylogeny-based test of mediation effect in microbiome. Genome Biol 2023; 24:72. [PMID: 37041566 PMCID: PMC10088256 DOI: 10.1186/s13059-023-02902-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Accepted: 03/15/2023] [Indexed: 04/13/2023] Open
Abstract
Microbiome data from sequencing experiments contain the relative abundance of a large number of microbial taxa with their evolutionary relationships represented by a phylogenetic tree. The compositional and high-dimensional nature of the microbiome mediator challenges the validity of standard mediation analyses. We propose a phylogeny-based mediation analysis method called PhyloMed to address this challenge. Unlike existing methods that directly identify individual mediating taxa, PhyloMed discovers mediation signals by analyzing subcompositions defined on the phylogenic tree. PhyloMed produces well-calibrated mediation test p-values and yields substantially higher discovery power than existing methods.
Collapse
Affiliation(s)
- Qilin Hong
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53715 USA
| | - Guanhua Chen
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53715 USA
| | - Zheng-Zheng Tang
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53715 USA
| |
Collapse
|
16
|
Pedone M, Amedei A, Stingo FC. Subject-specific Dirichlet-multinomial regression for multi-district microbiota data analysis. Ann Appl Stat 2023. [DOI: 10.1214/22-aoas1641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Affiliation(s)
- Matteo Pedone
- Department of Statistics, Computer Science, Applications, University of Florence
| | - Amedeo Amedei
- Department of Clinical and Experimental Medicine, University of Florence
| | - Francesco C. Stingo
- Department of Statistics, Computer Science, Applications, University of Florence
| |
Collapse
|
17
|
Busato S, Gordon M, Chaudhari M, Jensen I, Akyol T, Andersen S, Williams C. Compositionality, sparsity, spurious heterogeneity, and other data-driven challenges for machine learning algorithms within plant microbiome studies. CURRENT OPINION IN PLANT BIOLOGY 2023; 71:102326. [PMID: 36538837 PMCID: PMC9925409 DOI: 10.1016/j.pbi.2022.102326] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2022] [Revised: 11/08/2022] [Accepted: 11/21/2022] [Indexed: 06/17/2023]
Abstract
The plant-associated microbiome is a key component of plant systems, contributing to their health, growth, and productivity. The application of machine learning (ML) in this field promises to help untangle the relationships involved. However, measurements of microbial communities by high-throughput sequencing pose challenges for ML. Noise from low sample sizes, soil heterogeneity, and technical factors can impact the performance of ML. Additionally, the compositional and sparse nature of these datasets can impact the predictive accuracy of ML. We review recent literature from plant studies to illustrate that these properties often go unmentioned. We expand our analysis to other fields to quantify the degree to which mitigation approaches improve the performance of ML and describe the mathematical basis for this. With the advent of accessible analytical packages for microbiome data including learning models, researchers must be familiar with the nature of their datasets.
Collapse
Affiliation(s)
- Sebastiano Busato
- Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, USA; NC Plant Sciences Initiative, North Carolina State University, Raleigh, USA
| | - Max Gordon
- Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, USA; NC Plant Sciences Initiative, North Carolina State University, Raleigh, USA
| | - Meenal Chaudhari
- Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, USA; NC Plant Sciences Initiative, North Carolina State University, Raleigh, USA
| | - Ib Jensen
- Department of Molecular Biology and Genetics, Aarhus University, Aarhus, Denmark
| | - Turgut Akyol
- Department of Molecular Biology and Genetics, Aarhus University, Aarhus, Denmark
| | - Stig Andersen
- Department of Molecular Biology and Genetics, Aarhus University, Aarhus, Denmark
| | - Cranos Williams
- Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, USA; NC Plant Sciences Initiative, North Carolina State University, Raleigh, USA; Department of Plant and Microbial Biology, North Carolina State University, Raleigh, USA.
| |
Collapse
|
18
|
Lee S, Jung S, Lourenco J, Pringle D, Ahn J. Resampling-based inferences for compositional regression with application to beef cattle microbiomes. Stat Methods Med Res 2023; 32:151-164. [PMID: 36267026 DOI: 10.1177/09622802221133550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
Gut microbiomes are increasingly found to be associated with many health-related characteristics of humans as well as animals. Regression with compositional microbiomes covariates is commonly used to identify important bacterial taxa that are related to various phenotype responses. Often the dimension of microbiome taxa easily exceeds the number of available samples, which creates a serious challenge in the estimation and inference of the model. The sparse log-contrast regression method is useful for such cases as it can yield a model estimate that depends on only a small number of taxa. However, a formal statistical inference procedure for individual regression coefficients has not been properly established yet. We propose a new estimation and inference procedure for linear regression models with extremely low-sample-sized compositional predictors. Under the compositional log-contrast regression framework, the proposed approach consists of two steps. The first step is to screen relevant predictors by fitting a log-contrast model with a sparse penalty. The screened-in variables are used as predictors in the non-sparse log-contrast model in the second step, where each of the regression coefficients is tested using nonparametric, resampling-based methods such as permutation and bootstrap. The performances of the proposed methods are evaluated by a simulation study, which shows they outperform traditional approaches based on normal assumptions or large sample asymptotics. Application to steer microbiome data successfully identifies key bacterial taxa that are related to important cattle quality measures.
Collapse
Affiliation(s)
- Sujin Lee
- Department of Statistics, 26725Seoul National University, Seoul, Republic of Korea
| | - Sungkyu Jung
- Department of Statistics, 26725Seoul National University, Seoul, Republic of Korea
| | - Jeferson Lourenco
- Department of Animal and Dairy Science, 1355University of Georgia, Athens, GA, USA
| | - Dean Pringle
- Department of Animal and Dairy Science, 1355University of Georgia, Athens, GA, USA
| | - Jeongyoun Ahn
- Department of Industrial and Systems Engineering, 34968KAIST, Daejeon, Republic of Korea
| |
Collapse
|
19
|
Fan Z, Kernan KF, Sriram A, Benos PV, Canna SW, Carcillo JA, Kim S, Park HJ. Deep neural networks with knockoff features identify nonlinear causal relations and estimate effect sizes in complex biological systems. Gigascience 2022; 12:giad044. [PMID: 37395630 PMCID: PMC10316696 DOI: 10.1093/gigascience/giad044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Revised: 01/31/2023] [Accepted: 05/29/2023] [Indexed: 07/04/2023] Open
Abstract
BACKGROUND Learning the causal structure helps identify risk factors, disease mechanisms, and candidate therapeutics for complex diseases. However, although complex biological systems are characterized by nonlinear associations, existing bioinformatic methods of causal inference cannot identify the nonlinear relationships and estimate their effect size. RESULTS To overcome these limitations, we developed the first computational method that explicitly learns nonlinear causal relations and estimates the effect size using a deep neural network approach coupled with the knockoff framework, named causal directed acyclic graphs using deep learning variable selection (DAG-deepVASE). Using simulation data of diverse scenarios and identifying known and novel causal relations in molecular and clinical data of various diseases, we demonstrated that DAG-deepVASE consistently outperforms existing methods in identifying true and known causal relations. In the analyses, we also illustrate how identifying nonlinear causal relations and estimating their effect size help understand the complex disease pathobiology, which is not possible using other methods. CONCLUSIONS With these advantages, the application of DAG-deepVASE can help identify driver genes and therapeutic agents in biomedical studies and clinical trials.
Collapse
Affiliation(s)
- Zhenjiang Fan
- Department of Computer Science, University of Pittsburgh, Pittsburgh, PA 15213, USA
| | - Kate F Kernan
- Division of Pediatric Critical Care Medicine, Department of Critical Care Medicine, Children's Hospital of Pittsburgh, Center for Critical Care Nephrology and Clinical Research Investigation and Systems Modeling of Acute Illness Center, University of Pittsburgh, Pittsburgh, PA 15260,USA
| | - Aditya Sriram
- Department of Human Genetics, University of Pittsburgh, Pittsburgh, PA 15213, USA
| | - Panayiotis V Benos
- Department of Epidemiology, University of Florida, Gainesville, FL 32610, USA
| | - Scott W Canna
- Pediatric Rheumatology, The Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Joseph A Carcillo
- Division of Pediatric Critical Care Medicine, Department of Critical Care Medicine, Children's Hospital of Pittsburgh, Center for Critical Care Nephrology and Clinical Research Investigation and Systems Modeling of Acute Illness Center, University of Pittsburgh, Pittsburgh, PA 15260,USA
| | - Soyeon Kim
- Division of Pediatric Pulmonary Medicine, Children's Hospital of Pittsburgh, Pittsburgh, PA 15224, USA
- Department of Pediatrics, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15224, USA
| | - Hyun Jung Park
- Department of Human Genetics, University of Pittsburgh, Pittsburgh, PA 15213, USA
| |
Collapse
|
20
|
Okazaki A, Kawano S. Multi-Task Learning for Compositional Data via Sparse Network Lasso. ENTROPY (BASEL, SWITZERLAND) 2022; 24:1839. [PMID: 36554244 PMCID: PMC9777680 DOI: 10.3390/e24121839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2022] [Revised: 12/13/2022] [Accepted: 12/15/2022] [Indexed: 06/17/2023]
Abstract
Multi-task learning is a statistical methodology that aims to improve the generalization performances of estimation and prediction tasks by sharing common information among multiple tasks. On the other hand, compositional data consist of proportions as components summing to one. Because components of compositional data depend on each other, existing methods for multi-task learning cannot be directly applied to them. In the framework of multi-task learning, a network lasso regularization enables us to consider each sample as a single task and construct different models for each one. In this paper, we propose a multi-task learning method for compositional data using a sparse network lasso. We focus on a symmetric form of the log-contrast model, which is a regression model with compositional covariates. Our proposed method enables us to extract latent clusters and relevant variables for compositional data by considering relationships among samples. The effectiveness of the proposed method is evaluated through simulation studies and application to gut microbiome data. Both results show that the prediction accuracy of our proposed method is better than existing methods when information about relationships among samples is appropriately obtained.
Collapse
Affiliation(s)
- Akira Okazaki
- Graduate School of Informatics and Engineering, The University of Electro-Communications, 1-5-1 Chofugaoka, Chofu 182-8585, Tokyo, Japan
| | - Shuichi Kawano
- Faculty of Mathematics, Kyushu University, 744 Motooka, Nishi-ku 819-0395, Fukuoka, Japan
| |
Collapse
|
21
|
Carson TL, Buro AW, Miller D, Peña A, Ard JD, Lampe JW, Yi N, Lefkowitz E, William VDP, Morrow C, Wilson L, Barnes S, Demark-Wahnefried W. Rationale and study protocol for a randomized controlled feeding study to determine the structural- and functional-level effects of diet-specific interventions on the gut microbiota of non-Hispanic black and white adults. Contemp Clin Trials 2022; 123:106968. [PMID: 36265810 PMCID: PMC10095329 DOI: 10.1016/j.cct.2022.106968] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2022] [Revised: 10/12/2022] [Accepted: 10/13/2022] [Indexed: 01/27/2023]
Abstract
BACKGROUND Colorectal cancer (CRC), the third leading cause of cancer-related deaths in the US, has been associated with an overrepresentation or paucity of several microbial taxa in the gut microbiota, but causality has not been established. Black men and women have among the highest CRC incidence and mortality rates of any racial/ethnic group. This study will examine the impact of the Dietary Approaches to Stop Hypertension (DASH) diet on gut microbiota and fecal metabolites associated with CRC risk. METHODS A generally healthy sample of non-Hispanic Black and white adults (n = 112) is being recruited to participate in a parallel-arm randomized controlled feeding study. Participants are randomized to receive the DASH diet or a standard American diet for a 28-day period. Fecal samples are collected weekly throughout the study to analyze changes in the gut microbiota using 16 s rRNA and selected metagenomics. Differences in bacterial alpha and beta diversity and taxa that have been associated with CRC (Bacteroides, Fusobacterium, Clostridium, Lactobacillus, Bifidobacterium, Ruminococcus, Porphyromonas, Succinivibrio) are being evaluated. Covariate measures include body mass index, comorbidities, medication history, physical activity, stress, and demographic characteristics. CONCLUSION Our findings will provide preliminary evidence for the DASH diet as an approach for cultivating a healthier gut microbiota across non-Hispanic Black and non-Hispanic White adults. These results can impact clinical, translational, and population-level approaches for modification of the gut microbiota to reduce risk of chronic diseases including CRC. TRIAL REGISTRATION This study was registered on ClinicalTrials.gov, identifier NCT04538482, on September 4, 2020 (https://clinicaltrials.gov/ct2/show/NCT04538482).
Collapse
Affiliation(s)
- Tiffany L Carson
- Department of Health Outcomes and Behavior, Moffitt Cancer Center, Tampa, FL, United States of America.
| | - Acadia W Buro
- Department of Health Outcomes and Behavior, Moffitt Cancer Center, Tampa, FL, United States of America
| | - Darci Miller
- Department of Health Outcomes and Behavior, Moffitt Cancer Center, Tampa, FL, United States of America
| | - Alissa Peña
- Department of Health Outcomes and Behavior, Moffitt Cancer Center, Tampa, FL, United States of America
| | - Jamy D Ard
- Department of Epidemiology and Prevention, Wake Forest School of Medicine, Winston-Salem, NC, United States of America
| | - Johanna W Lampe
- Public Health Science Division, Fred Hutchinson Cancer Center, Seattle, WA, United States of America
| | - Nengjun Yi
- Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, AL, United States of America
| | - Elliot Lefkowitz
- Center for Clinical and Translational Sciences, University of Alabama at Birmingham, Birmingham, AL, United States of America; Department of Microbiology, University of Alabama at Birmingham, Birmingham, AL, United States of America
| | - Van Der Pol William
- Center for Clinical and Translational Sciences, University of Alabama at Birmingham, Birmingham, AL, United States of America
| | - Casey Morrow
- Department of Cell, Developmental and Integrative Biology, University of Alabama at Birmingham, Birmingham, AL, United States of America
| | - Landon Wilson
- Department of Pharmacology and Toxicology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States of America; Targeted Metabolomics and Proteomics Laboratory, University of Alabama at Birmingham, Birmingham, AL, United States of America
| | - Stephen Barnes
- Department of Pharmacology and Toxicology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States of America; Targeted Metabolomics and Proteomics Laboratory, University of Alabama at Birmingham, Birmingham, AL, United States of America
| | - Wendy Demark-Wahnefried
- O'Neal Comprehensive Cancer Center, University of Alabama at Birmingham, Birmingham, AL, United States of America
| |
Collapse
|
22
|
Rashidi A, Peled JU, Ebadi M, Rehman TU, Elhusseini H, Marcello LT, Halaweish H, Kaiser T, Holtan SG, Khoruts A, Weisdorf DJ, Staley C. Protective Effect of Intestinal Blautia Against Neutropenic Fever in Allogeneic Transplant Recipients. Clin Infect Dis 2022; 75:1912-1920. [PMID: 35435976 DOI: 10.1093/cid/ciac299] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2022] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND Neutropenic fever (NF) occurs in >70% of hematopoietic cell transplant (HCT) recipients, without a documented cause in most cases. Antibiotics used to prevent and treat NF disrupt the gut microbiota; these disruptions predict a higher posttransplantation mortality rate. We hypothesized that specific features in the gut microbial community may mediate the risk of NF. METHODS We searched a large gut microbiota database in allogeneic HCT recipients (12 546 stool samples; 1278 patients) to find pairs with NF (cases) versus without NF (controls) on the same day relative to transplantation and with a stool sample on the previous day. A total of 179 such pairs were matched as to the underlying disease and graft source. Several other important clinical variables were similar between the groups. RESULTS The gut microbiota of cases on the day before NF occurrence had a lower abundance of Blautia than their matched controls on the same day after transplantation, suggesting a protective role for Blautia. Microbiota network analysis did not find any differences in community structure between the groups, suggesting a single-taxon effect. To identify putative mechanisms, we searched a gut microbiome and serum metabolome database of patients with acute leukemia receiving chemotherapy and identified 139 serum samples collected within 24 hours after a stool sample from the same patient. Greater Blautia abundances predicted higher levels of next-day citrulline, a biomarker of total enterocyte mass. CONCLUSIONS These findings support a model in which Blautia protects against NF by improving intestinal health. Therapeutic restoration of Blautia may help prevent NF, thus reducing antibiotic exposures and transplantation-related deaths.
Collapse
Affiliation(s)
- Armin Rashidi
- Division of Hematology, Oncology, and Transplantation, Department of Medicine, University of Minnesota, Minneapolis, Minnesota, USA
| | - Jonathan U Peled
- Adult Bone Marrow Transplantation Service, Memorial Sloan Kettering Cancer Center and Weill Cornell Medical College, New York, New York, USA
| | - Maryam Ebadi
- Division of Hematology, Oncology, and Transplantation, Department of Medicine, University of Minnesota, Minneapolis, Minnesota, USA
| | - Tauseef Ur Rehman
- Division of Hematology, Oncology, and Transplantation, Department of Medicine, University of Minnesota, Minneapolis, Minnesota, USA
| | - Heba Elhusseini
- Division of Hematology, Oncology, and Transplantation, Department of Medicine, University of Minnesota, Minneapolis, Minnesota, USA
| | - LeeAnn T Marcello
- Adult Bone Marrow Transplantation Service, Memorial Sloan Kettering Cancer Center and Weill Cornell Medical College, New York, New York, USA
| | - Hossam Halaweish
- Department of Surgery, University of Minnesota, Minneapolis, Minnesota, USA
| | - Thomas Kaiser
- Department of Surgery, University of Minnesota, Minneapolis, Minnesota, USA
| | - Shernan G Holtan
- Division of Hematology, Oncology, and Transplantation, Department of Medicine, University of Minnesota, Minneapolis, Minnesota, USA
| | - Alexander Khoruts
- Division of Gastroenterology, Hepatology, and Nutrition, Department of Medicine, University of Minnesota, Minneapolis, Minnesota, USA
| | - Daniel J Weisdorf
- Division of Hematology, Oncology, and Transplantation, Department of Medicine, University of Minnesota, Minneapolis, Minnesota, USA
| | - Christopher Staley
- Department of Surgery, University of Minnesota, Minneapolis, Minnesota, USA.,BioTechnology Institute, University of Minnesota, St Paul, Minnesota, USA
| |
Collapse
|
23
|
Li D, Srinivasan A, Chen Q, Xue L. Robust Covariance Matrix Estimation for High-Dimensional Compositional Data with Application to Sales Data Analysis. JOURNAL OF BUSINESS & ECONOMIC STATISTICS : A PUBLICATION OF THE AMERICAN STATISTICAL ASSOCIATION 2022; 41:1090-1100. [PMID: 38125739 PMCID: PMC10730115 DOI: 10.1080/07350015.2022.2106990] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/23/2023]
Abstract
Compositional data arises in a wide variety of research areas when some form of standardization and composition is necessary. Estimating covariance matrices is of fundamental importance for high-dimensional compositional data analysis. However, existing methods require the restrictive Gaussian or sub-Gaussian assumption, which may not hold in practice. We propose a robust composition adjusted thresholding covariance procedure based on Huber-type M-estimation to estimate the sparse covariance structure of high-dimensional compositional data. We introduce a cross-validation procedure to choose the tuning parameters of the proposed method. Theoretically, by assuming a bounded fourth moment condition, we obtain the rates of convergence and signal recovery property for the proposed method and provide the theoretical guarantees for the cross-validation procedure under the high-dimensional setting. Numerically, we demonstrate the effectiveness of the proposed method in simulation studies and also a real application to sales data analysis.
Collapse
Affiliation(s)
- Danning Li
- School of Mathematics and Statistics and KLAS, Northeast Normal University
| | | | - Qian Chen
- Department of Supply Chain and Information Systems, Penn State University
| | | |
Collapse
|
24
|
Identification of microbial features in multivariate regression under false discovery rate control. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2022.107621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
25
|
Jiang L, Haiminen N, Carrieri A, Huang S, Vázquez‐Baeza Y, Parida L, Kim H, Swafford AD, Knight R, Natarajan L. Utilizing stability criteria in choosing feature selection methods yields reproducible results in microbiome data. Biometrics 2022; 78:1155-1167. [PMID: 33914902 PMCID: PMC9787628 DOI: 10.1111/biom.13481] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Revised: 02/25/2021] [Accepted: 04/14/2021] [Indexed: 12/31/2022]
Abstract
Feature selection is indispensable in microbiome data analysis, but it can be particularly challenging as microbiome data sets are high dimensional, underdetermined, sparse and compositional. Great efforts have recently been made on developing new methods for feature selection that handle the above data characteristics, but almost all methods were evaluated based on performance of model predictions. However, little attention has been paid to address a fundamental question: how appropriate are those evaluation criteria? Most feature selection methods often control the model fit, but the ability to identify meaningful subsets of features cannot be evaluated simply based on the prediction accuracy. If tiny changes to the data would lead to large changes in the chosen feature subset, then many selected features are likely to be a data artifact rather than real biological signal. This crucial need of identifying relevant and reproducible features motivated the reproducibility evaluation criterion such as Stability, which quantifies how robust a method is to perturbations in the data. In our paper, we compare the performance of popular model prediction metrics (MSE or AUC) with proposed reproducibility criterion Stability in evaluating four widely used feature selection methods in both simulations and experimental microbiome applications with continuous or binary outcomes. We conclude that Stability is a preferred feature selection criterion over model prediction metrics because it better quantifies the reproducibility of the feature selection method.
Collapse
Affiliation(s)
- Lingjing Jiang
- Division of BiostatisticsUniversity of California San DiegoLa JollaCaliforniaUSA
| | - Niina Haiminen
- IBM T. J. Watson Research CenterYorktown HeightsNew YorkUSA
| | | | - Shi Huang
- Center for Microbiome InnovationJacobs School of EngineeringUC San DiegoLa JollaCaliforniaUSA,Department of PediatricsUniversity of California San DiegoLa JollaCaliforniaUSA
| | - Yoshiki Vázquez‐Baeza
- Center for Microbiome InnovationJacobs School of EngineeringUC San DiegoLa JollaCaliforniaUSA,Department of PediatricsUniversity of California San DiegoLa JollaCaliforniaUSA
| | - Laxmi Parida
- IBM T. J. Watson Research CenterYorktown HeightsNew YorkUSA
| | - Ho‐Cheol Kim
- Scalable Knowledge IntelligenceIBM Research‐AlmadenSan JoseCaliforniaUSA
| | - Austin D. Swafford
- Center for Microbiome InnovationJacobs School of EngineeringUC San DiegoLa JollaCaliforniaUSA
| | - Rob Knight
- Center for Microbiome InnovationJacobs School of EngineeringUC San DiegoLa JollaCaliforniaUSA,Department of PediatricsUniversity of California San DiegoLa JollaCaliforniaUSA,Department of Computer Science and EngineeringUniversity of California San DiegoLa JollaCaliforniaUSA,Department of BioengineeringUniversity of California San DiegoLa JollaCaliforniaUSA
| | - Loki Natarajan
- Division of BiostatisticsUniversity of California San DiegoLa JollaCaliforniaUSA
| |
Collapse
|
26
|
Bucket Fuser: Statistical Signal Extraction for 1D 1H NMR Metabolomic Data. Metabolites 2022; 12:metabo12090812. [PMID: 36144216 PMCID: PMC9501206 DOI: 10.3390/metabo12090812] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 08/20/2022] [Accepted: 08/25/2022] [Indexed: 11/17/2022] Open
Abstract
Untargeted metabolomics is a promising tool for identifying novel disease biomarkers and unraveling underlying pathomechanisms. Nuclear magnetic resonance (NMR) spectroscopy is particularly suited for large-scale untargeted metabolomics studies due to its high reproducibility and cost effectiveness. Here, one-dimensional (1D) 1H NMR experiments offer good sensitivity at reasonable measurement times. Their subsequent data analysis requires sophisticated data preprocessing steps, including the extraction of NMR features corresponding to specific metabolites. We developed a novel 1D NMR feature extraction procedure, called Bucket Fuser (BF), which is based on a regularized regression framework with fused group LASSO terms. The performance of the BF procedure was demonstrated using three independent NMR datasets and was benchmarked against existing state-of-the-art NMR feature extraction methods. BF dynamically constructs NMR metabolite features, the widths of which can be adjusted via a regularization parameter. BF consistently improved metabolite signal extraction, as demonstrated by our correlation analyses with absolutely quantified metabolites. It also yielded a higher proportion of statistically significant metabolite features in our differential metabolite analyses. The BF algorithm is computationally efficient and it can deal with small sample sizes. In summary, the Bucket Fuser algorithm, which is available as a supplementary python code, facilitates the fast and dynamic extraction of 1D NMR signals for the improved detection of metabolic biomarkers.
Collapse
|
27
|
Monti GS, Filzmoser P. A robust knockoff filter for sparse regression analysis of microbiome compositional data. Comput Stat 2022. [DOI: 10.1007/s00180-022-01268-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
AbstractMicrobiome data analysis often relies on the identification of a subset of potential biomarkers associated with a clinical outcome of interest. Robust ZeroSum regression, an elastic-net penalized compositional regression built on the least trimmed squares estimator, is a variable selection procedure capable to cope with the high dimensionality of these data, their compositional nature, and, at the same time, it guarantees robustness against the presence of outliers. The necessity of discovering “true” effects and to improve clinical research quality and reproducibility has motivated us to propose a two-step robust compositional knockoff filter procedure, which allows selecting the set of relevant biomarkers, among the many measured features having a nonzero effect on the response, controlling the expected fraction of false positives. We demonstrate the effectiveness of our proposal in an extensive simulation study, and illustrate its usefulness in an application to intestinal microbiome analysis.
Collapse
|
28
|
Coenders G, Greenacre M. Three approaches to supervised learning for compositional data with pairwise logratios. J Appl Stat 2022; 50:3272-3293. [PMID: 37969895 PMCID: PMC10637191 DOI: 10.1080/02664763.2022.2108007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Accepted: 07/25/2022] [Indexed: 10/15/2022]
Abstract
Logratios between pairs of compositional parts (pairwise logratios) are the easiest to interpret in compositional data analysis, and include the well-known additive logratios as particular cases. When the number of parts is large (sometimes even larger than the number of cases), some form of logratio selection is needed. In this article, we present three alternative stepwise supervised learning methods to select the pairwise logratios that best explain a dependent variable in a generalized linear model, each geared for a specific problem. The first method features unrestricted search, where any pairwise logratio can be selected. This method has a complex interpretation if some pairs of parts in the logratios overlap, but it leads to the most accurate predictions. The second method restricts parts to occur only once, which makes the corresponding logratios intuitively interpretable. The third method uses additive logratios, so that K-1 selected logratios involve a K-part subcomposition. Our approach allows logratios or non-compositional covariates to be forced into the models based on theoretical knowledge, and various stopping criteria are available based on information measures or statistical significance with the Bonferroni correction. We present an application on a dataset from a study predicting Crohn's disease.
Collapse
Affiliation(s)
- Germà Coenders
- Department of Economics, Universitat de Girona, Girona, Spain
| | - Michael Greenacre
- Department of Economics and Business and Barcelona School of Management, Universitat Pompeu Fabra, Barcelona, Spain
| |
Collapse
|
29
|
Zhou F, He K, Li Q, Chapkin RS, Ni Y. Bayesian biclustering for microbial metagenomic sequencing data via multinomial matrix factorization. Biostatistics 2022; 23:891-909. [PMID: 33634824 PMCID: PMC9291645 DOI: 10.1093/biostatistics/kxab002] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2020] [Revised: 10/08/2020] [Accepted: 01/10/2021] [Indexed: 12/26/2022] Open
Abstract
High-throughput sequencing technology provides unprecedented opportunities to quantitatively explore human gut microbiome and its relation to diseases. Microbiome data are compositional, sparse, noisy, and heterogeneous, which pose serious challenges for statistical modeling. We propose an identifiable Bayesian multinomial matrix factorization model to infer overlapping clusters on both microbes and hosts. The proposed method represents the observed over-dispersed zero-inflated count matrix as Dirichlet-multinomial mixtures on which latent cluster structures are built hierarchically. Under the Bayesian framework, the number of clusters is automatically determined and available information from a taxonomic rank tree of microbes is naturally incorporated, which greatly improves the interpretability of our findings. We demonstrate the utility of the proposed approach by comparing to alternative methods in simulations. An application to a human gut microbiome data set involving patients with inflammatory bowel disease reveals interesting clusters, which contain bacteria families Bacteroidaceae, Bifidobacteriaceae, Enterobacteriaceae, Fusobacteriaceae, Lachnospiraceae, Ruminococcaceae, Pasteurellaceae, and Porphyromonadaceae that are known to be related to the inflammatory bowel disease and its subtypes according to biological literature. Our findings can help generate potential hypotheses for future investigation of the heterogeneity of the human gut microbiome.
Collapse
Affiliation(s)
- Fangting Zhou
- Department of Statistics, Texas A&M University, College Station, TX, USA and Institute of Statistics and Big Data, Renmin University of China, Beijing, China
| | - Kejun He
- Center for Applied Statistics, Institute of Statistics and Big Data, Renmin University of China, Beijing, China
| | - Qiwei Li
- Department of Mathematical Sciences, The University of Texas at Dallas, Dallas, TX, USA
| | - Robert S Chapkin
- Department of Nutrition and Food Science, Texas A&M University, College Station, TX, USA
| | - Yang Ni
- Department of Statistics, Texas A&M University, College Station, TX, USA
| |
Collapse
|
30
|
Verster A, Petronella N, Green J, Matias F, Brooks SPJ. A Bayesian method for identifying associations between response variables and bacterial community composition. PLoS Comput Biol 2022; 18:e1010108. [PMID: 35793382 PMCID: PMC9307184 DOI: 10.1371/journal.pcbi.1010108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2021] [Revised: 07/22/2022] [Accepted: 04/14/2022] [Indexed: 11/18/2022] Open
Abstract
Determining associations between intestinal bacteria and continuously measured physiological outcomes is important for understanding the bacteria-host relationship but is not straightforward since abundance data (compositional data) are not normally distributed. To address this issue, we developed a fully Bayesian linear regression model (BRACoD; Bayesian Regression Analysis of Compositional Data) with physiological measurements (continuous data) as a function of a matrix of relative bacterial abundances. Bacteria can be classified as operational taxonomic units or by taxonomy (genus, family, etc.). Bacteria associated with the physiological measurement were identified using a Bayesian variable selection method: Stochastic Search Variable Selection. The output is a list of inclusion probabilities ([Formula: see text]) and coefficients that indicate the strength of the association ([Formula: see text]) for each bacterial taxa. Tests with simulated communities showed that adopting a cut point value of [Formula: see text] ≥ 0.3 for identifying included bacteria optimized the true positive rate (TPR) while maintaining a false positive rate (FPR) of ≤ 5%. At this point, the chances of identifying non-contributing bacteria were low and all well-established contributors were included. Comparison with other methods showed that BRACoD (at [Formula: see text] ≥ 0.3) had higher precision and a higher TPR than a commonly used center log transformed LASSO procedure (clr-LASSO) as well as higher TPR than an off-the-shelf Spike and Slab method after center log transformation (clr-SS). BRACoD was also less likely to include non-contributing bacteria that merely correlate with contributing bacteria. Analysis of a rat microbiome experiment identified 47 operational taxonomic units that contributed to fecal butyrate levels. Of these, 31 were positively and 16 negatively associated with butyrate. Consistent with their known role in butyrate metabolism, most of these fell within the Lachnospiraceae and Ruminococcaceae. We conclude that BRACoD provides a more precise and accurate method for determining bacteria associated with a continuous physiological outcome compared to clr-LASSO. It is more sensitive than a generalized clr-SS algorithm, although it has a higher FPR. Its ability to distinguish genuine contributors from correlated bacteria makes it better suited to discriminating bacteria that directly contribute to an outcome. The algorithm corrects for the distortions arising from compositional data making it appropriate for analysis of microbiome data.
Collapse
Affiliation(s)
- Adrian Verster
- Bureau of Food Surveillance and Science Integration, Food Directorate, Health Products and Food Branch, Health Canada, Ottawa, Canada
| | - Nicholas Petronella
- Bureau of Food Surveillance and Science Integration, Food Directorate, Health Products and Food Branch, Health Canada, Ottawa, Canada
| | - Judy Green
- Bureau of Nutritional Sciences, Food Directorate, Health Products and Food Branch, Health Canada, Ottawa, Canada
| | - Fernando Matias
- Bureau of Nutritional Sciences, Food Directorate, Health Products and Food Branch, Health Canada, Ottawa, Canada
| | - Stephen P. J. Brooks
- Bureau of Nutritional Sciences, Food Directorate, Health Products and Food Branch, Health Canada, Ottawa, Canada
| |
Collapse
|
31
|
Principal Amalgamation Analysis for Microbiome Data. Genes (Basel) 2022; 13:genes13071139. [PMID: 35885922 PMCID: PMC9318429 DOI: 10.3390/genes13071139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2022] [Revised: 06/14/2022] [Accepted: 06/21/2022] [Indexed: 12/02/2022] Open
Abstract
In recent years microbiome studies have become increasingly prevalent and large-scale. Through high-throughput sequencing technologies and well-established analytical pipelines, relative abundance data of operational taxonomic units and their associated taxonomic structures are routinely produced. Since such data can be extremely sparse and high dimensional, there is often a genuine need for dimension reduction to facilitate data visualization and downstream statistical analysis. We propose Principal Amalgamation Analysis (PAA), a novel amalgamation-based and taxonomy-guided dimension reduction paradigm for microbiome data. Our approach aims to aggregate the compositions into a smaller number of principal compositions, guided by the available taxonomic structure, by minimizing a properly measured loss of information. The choice of the loss function is flexible and can be based on familiar diversity indices for preserving either within-sample or between-sample diversity in the data. To enable scalable computation, we develop a hierarchical PAA algorithm to trace the entire trajectory of successive simple amalgamations. Visualization tools including dendrogram, scree plot, and ordination plot are developed. The effectiveness of PAA is demonstrated using gut microbiome data from a preterm infant study and an HIV infection study.
Collapse
|
32
|
Loss of microbiota-derived protective metabolites after neutropenic fever. Sci Rep 2022; 12:6244. [PMID: 35428797 PMCID: PMC9012881 DOI: 10.1038/s41598-022-10282-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2021] [Accepted: 04/05/2022] [Indexed: 11/08/2022] Open
Abstract
Neutropenic fever (NF) is a common complication of chemotherapy in patients with cancer which often prolongs hospitalization and worsens the quality of life. Although an empiric antimicrobial approach is used to prevent and treat NF, a clear etiology cannot be found in most cases. Emerging data suggest an altered microbiota-host crosstalk leading to NF. We profiled the serum metabolome and gut microbiome in longitudinal samples before and after NF in patients with acute myeloid leukemia, a prototype setting with a high incidence of NF. We identified a circulating metabolomic shift after NF, with a minimal signature containing 18 metabolites, 13 of which were associated with the gut microbiota. Among these metabolites were markers of intestinal epithelial health and bacterial metabolites of dietary tryptophan with known anti-inflammatory and gut-protective effects. The level of these metabolites decreased after NF, in parallel with biologically consistent changes in the abundance of mucolytic and butyrogenic bacteria with known effects on the intestinal epithelium. Together, our findings indicate a metabolomic shift with NF which is primarily characterized by a loss of microbiota-derived protective metabolites rather than an increase in detrimental metabolites. This analysis suggests that the current antimicrobial approach to NF may need a revision to protect the commensal microbiota.
Collapse
|
33
|
Wang B, Caffo BS, Luo X, Liu C, Faria AV, Miller MI, Zhao Y. Regularized regression on compositional trees with application to MRI analysis. J R Stat Soc Ser C Appl Stat 2022; 71:541-561. [DOI: 10.1111/rssc.12545] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Affiliation(s)
- Bingkai Wang
- Department of BiostatisticsJohns Hopkins Bloomberg School of Public Health BaltimoreMarylandUSA
| | - Brian S. Caffo
- Department of BiostatisticsJohns Hopkins Bloomberg School of Public Health BaltimoreMarylandUSA
| | - Xi Luo
- Department of Biostatistics and Data ScienceThe University of Texas Health Science Center at Houston HoustonTexasUSA
| | - Chin‐Fu Liu
- Center for Imaging Science, Biomedical EngineeringJohns Hopkins University BaltimoreMarylandUSA
| | - Andreia V. Faria
- Department of RadiologyJohns Hopkins University School of Medicine BaltimoreMarylandUSA
| | - Michael I. Miller
- Center for Imaging Science, Biomedical EngineeringJohns Hopkins University BaltimoreMarylandUSA
| | - Yi Zhao
- Department of BiostatisticsIndiana University School of Medicine and for the Alzheimer's Disease Neuroimaging Initiative IndianapolisIndianaUSA
| | | |
Collapse
|
34
|
Greenbaum J, Lin X, Su KJ, Gong R, Shen H, Shen J, Xiao HM, Deng HW. Integration of the Human Gut Microbiome and Serum Metabolome Reveals Novel Biological Factors Involved in the Regulation of Bone Mineral Density. Front Cell Infect Microbiol 2022; 12:853499. [PMID: 35372129 PMCID: PMC8966780 DOI: 10.3389/fcimb.2022.853499] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Accepted: 02/21/2022] [Indexed: 12/12/2022] Open
Abstract
While the gut microbiome has been reported to play a role in bone metabolism, the individual species and underlying functional mechanisms have not yet been characterized. We conducted a systematic multi-omics analysis using paired metagenomic and untargeted serum metabolomic profiles from a large sample of 499 peri- and early post-menopausal women to identify the potential crosstalk between these biological factors which may be involved in the regulation of bone mineral density (BMD). Single omics association analyses identified 22 bacteria species and 17 serum metabolites for putative association with BMD. Among the identified bacteria, Bacteroidetes and Fusobacteria were negatively associated, while Firmicutes were positively associated. Several of the identified serum metabolites including 3-phenylpropanoic acid, mainly derived from dietary polyphenols, and glycolithocholic acid, a secondary bile acid, are metabolic byproducts of the microbiota. We further conducted a supervised integrative feature selection with respect to BMD and constructed the inter-omics partial correlation network. Although still requiring replication and validation in future studies, the findings from this exploratory analysis provide novel insights into the interrelationships between the gut microbiome and serum metabolome that may potentially play a role in skeletal remodeling processes.
Collapse
Affiliation(s)
- Jonathan Greenbaum
- Tulane Center of Biomedical Informatics and Genomics, Deming Department of Medicine, Tulane University School of Medicine, Tulane University, New Orleans, LA, United States
| | - Xu Lin
- Department of Endocrinology and Metabolism, The Third Affiliated Hospital of Southern Medical University, Guangzhou, China
| | - Kuan-Jui Su
- Tulane Center of Biomedical Informatics and Genomics, Deming Department of Medicine, Tulane University School of Medicine, Tulane University, New Orleans, LA, United States
| | - Rui Gong
- Department of Endocrinology and Metabolism, The Third Affiliated Hospital of Southern Medical University, Guangzhou, China
| | - Hui Shen
- Tulane Center of Biomedical Informatics and Genomics, Deming Department of Medicine, Tulane University School of Medicine, Tulane University, New Orleans, LA, United States
| | - Jie Shen
- Department of Endocrinology and Metabolism, The Third Affiliated Hospital of Southern Medical University, Guangzhou, China
| | - Hong-Mei Xiao
- Center of Systems Biology, Data Information and Reproductive Health, School of Basic Medical Science, Central South University, Changsha, China
| | - Hong-Wen Deng
- Tulane Center of Biomedical Informatics and Genomics, Deming Department of Medicine, Tulane University School of Medicine, Tulane University, New Orleans, LA, United States
| |
Collapse
|
35
|
Banerjee K, Chen J, Zhan X. Adaptive and powerful microbiome multivariate association analysis via feature selection. NAR Genom Bioinform 2022; 4:lqab120. [PMID: 35047812 PMCID: PMC8759573 DOI: 10.1093/nargab/lqab120] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2021] [Revised: 11/13/2021] [Accepted: 12/24/2021] [Indexed: 02/06/2023] Open
Abstract
The important role of human microbiome is being increasingly recognized in health and disease conditions. Since microbiome data is typically high dimensional, one popular mode of statistical association analysis for microbiome data is to pool individual microbial features into a group, and then conduct group-based multivariate association analysis. A corresponding challenge within this approach is to achieve adequate power to detect an association signal between a group of microbial features and the outcome of interest across a wide range of scenarios. Recognizing some existing methods' susceptibility to the adverse effects of noise accumulation, we introduce the Adaptive Microbiome Association Test (AMAT), a novel and powerful tool for multivariate microbiome association analysis, which unifies both blessings of feature selection in high-dimensional inference and robustness of adaptive statistical association testing. AMAT first alleviates the burden of noise accumulation via distance correlation learning, and then conducts a data-adaptive association test under the flexible generalized linear model framework. Extensive simulation studies and real data applications demonstrate that AMAT is highly robust and often more powerful than several existing methods, while preserving the correct type I error rate. A free implementation of AMAT in R computing environment is available at https://github.com/kzb193/AMAT.
Collapse
Affiliation(s)
| | | | - Xiang Zhan
- To whom correspondence should be addressed. Tel: +86 10 62744132; Fax: +86 10 62744134;
| |
Collapse
|
36
|
Rosas-Salazar C, Tang ZZ, Shilts MH, Turi KN, Hong Q, Wiggins DA, Lynch CE, Gebretsadik T, Chappell JD, Peebles RS, Anderson LJ, Das SR, Hartert TV. Upper respiratory tract bacterial-immune interactions during respiratory syncytial virus infection in infancy. J Allergy Clin Immunol 2022; 149:966-976. [PMID: 34534566 PMCID: PMC9036861 DOI: 10.1016/j.jaci.2021.08.022] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Revised: 07/23/2021] [Accepted: 08/26/2021] [Indexed: 01/04/2023]
Abstract
BACKGROUND The risk factors determining short- and long-term morbidity following acute respiratory infection (ARI) due to respiratory syncytial virus (RSV) in infancy remain poorly understood. OBJECTIVES Our aim was to examine the associations of the upper respiratory tract (URT) microbiome during RSV ARI in infancy with the acute local immune response and short- and long-term clinical outcomes. METHODS We characterized the URT microbiome by 16S ribosomal RNA sequencing and assessed the acute local immune response by measuring 53 immune mediators with high-throughput immunoassays in 357 RSV-infected infants. Our short- and long-term clinical outcomes included several markers of disease severity and the number of wheezing episodes in the fourth year of life, respectively. RESULTS We found several specific URT bacterial-immune mediator associations. In addition, the Shannon ⍺-diversity index of the URT microbiome was associated with a higher respiratory severity score (β =.50 [95% CI = 0.13-0.86]), greater odds of a lower ARI (odds ratio = 1.63 [95% CI = 1.10-2.43]), and higher number of wheezing episodes in the fourth year of life (β = 0.89 [95% CI = 0.37-1.40]). The Jaccard β-diversity index of the URT microbiome differed by level of care required (P = .04). Furthermore, we found an interaction between the Shannon ⍺-diversity index of the URT microbiome and the first principal component of the acute local immune response on the respiratory severity score (P = .048). CONCLUSIONS The URT microbiome during RSV ARI in infancy is associated with the acute local immune response, disease severity, and number of wheezing episodes in the fourth year of life. Our results also suggest complex URT bacterial-immune interactions that can affect the severity of the RSV ARI.
Collapse
Affiliation(s)
- Christian Rosas-Salazar
- Division of Allergy, Immunology, and Pulmonary Medicine, Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN
| | - Zheng-Zheng Tang
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI
| | - Meghan H. Shilts
- Division of Infectious Diseases, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
| | - Kedir N. Turi
- Division of Allergy, Pulmonary, and Critical Care Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
| | - Qilin Hong
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI
| | - Derek A Wiggins
- Division of Allergy, Pulmonary, and Critical Care Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
| | - Christian E. Lynch
- Division of Allergy, Pulmonary, and Critical Care Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
| | - Tebeb Gebretsadik
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN
| | - James D. Chappell
- Division of Infectious Diseases, Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN
| | - R. Stokes Peebles
- Division of Allergy, Pulmonary, and Critical Care Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
| | - Larry J. Anderson
- Division of Infectious Diseases, Department of Pediatrics, Emory University and Children’s Healthcare of Atlanta, Atlanta, GA
| | - Suman R. Das
- Division of Infectious Diseases, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN,Department of Otolaryngology-Head and Neck Surgery, Vanderbilt University Medical Center, Nashville, TN,Corresponding Authors: Suman R. Das, PhD, Division of Infectious Diseases, Department of Medicine, Vanderbilt University Medical Center, 1161 21st Avenue South, Medical Center North, Suite A2200, Nashville, TN 37232, Phone: (615) 322-0322, Fax: (615) 343-6160, ; Tina V. Hartert, MD, MPH, Division of Allergy, Pulmonary, and Critical Care Medicine, Department of Medicine, Vanderbilt University Medical Center, 2525 West End Avenue, Suite 450, Nashville, TN 37232, Phone: (615) 936-3597, Fax: (615) 936-1269,
| | - Tina V. Hartert
- Division of Allergy, Pulmonary, and Critical Care Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN,Corresponding Authors: Suman R. Das, PhD, Division of Infectious Diseases, Department of Medicine, Vanderbilt University Medical Center, 1161 21st Avenue South, Medical Center North, Suite A2200, Nashville, TN 37232, Phone: (615) 322-0322, Fax: (615) 343-6160, ; Tina V. Hartert, MD, MPH, Division of Allergy, Pulmonary, and Critical Care Medicine, Department of Medicine, Vanderbilt University Medical Center, 2525 West End Avenue, Suite 450, Nashville, TN 37232, Phone: (615) 936-3597, Fax: (615) 936-1269,
| |
Collapse
|
37
|
|
38
|
Sohn MB, Lu J, Li H. A compositional mediation model for a binary outcome: Application to microbiome studies. Bioinformatics 2021; 38:16-21. [PMID: 34415327 DOI: 10.1093/bioinformatics/btab605] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Revised: 07/06/2021] [Accepted: 08/18/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION The delicate balance of the microbiome is implicated in our health and is shaped by external factors, such as diet and xenobiotics. Therefore, understanding the role of the microbiome in linking external factors and our health conditions is crucial to translate microbiome research into therapeutic and preventative applications. RESULTS We introduced a sparse compositional mediation model for binary outcomes to estimate and test the mediation effects of the microbiome utilizing the compositional algebra defined in the simplex space and a linear zero-sum constraint on probit regression coefficients. For this model with the standard causal assumptions, we showed that both the causal direct and indirect effects are identifiable. We further developed a method for sensitivity analysis for the assumption of the no unmeasured confounding effects between the mediator and the outcome. We conducted extensive simulation studies to assess the performance of the proposed method and applied it to real microbiome data to study mediation effects of the microbiome on linking fat intake to overweight/obesity. AVAILABILITY AND IMPLEMENTATION An R package can be downloaded from https://github.com/mbsohn/cmmb. SUPPLEMENTARY INFORMATION Supplementary files are available at Bioinformatics online.
Collapse
Affiliation(s)
- Michael B Sohn
- Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY 14642, USA
| | - Jiarui Lu
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
39
|
Alenazi A. A review of compositional data analysis and recent advances. COMMUN STAT-THEOR M 2021. [DOI: 10.1080/03610926.2021.2014890] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Abdulaziz Alenazi
- Department of Mathematics, College of Science, Northern Border University, Arar, Saudi Arabia
| |
Collapse
|
40
|
Liu X, Cong X, Li G, Maas K, Chen K. Multivariate log-contrast regression with sub-compositional predictors: Testing the association between preterm infants' gut microbiome and neurobehavioral outcomes. Stat Med 2021; 41:580-594. [PMID: 34897772 DOI: 10.1002/sim.9273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2020] [Revised: 09/25/2021] [Accepted: 11/15/2021] [Indexed: 11/10/2022]
Abstract
To link a clinical outcome with compositional predictors in microbiome analysis, the linear log-contrast model is a popular choice, and the inference procedure for assessing the significance of each covariate is also available. However, with the existence of multiple potentially interrelated outcomes and the information of the taxonomic hierarchy of bacteria, a multivariate analysis method that considers the group structure of compositional covariates and an accompanying group inference method are still lacking. Motivated by a study for identifying the microbes in the gut microbiome of preterm infants that impact their later neurobehavioral outcomes, we formulate a constrained integrative multi-view regression. The neurobehavioral scores form multivariate responses, the log-transformed sub-compositional microbiome data form multi-view feature matrices, and a set of linear constraints on their corresponding sub-coefficient matrices ensures the sub-compositional nature. We assume all the sub-coefficient matrices are possible of low-rank to enable joint selection and inference of sub-compositions/views. We propose a scaled composite nuclear norm penalization approach for model estimation and develop a hypothesis testing procedure through de-biasing to assess the significance of different views. Simulation studies confirm the effectiveness of the proposed procedure. We apply the method to the preterm infant study, and the identified microbes are mostly consistent with existing studies and biological understandings.
Collapse
Affiliation(s)
- Xiaokang Liu
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Xiaomei Cong
- School of Nursing, University of Connecticut, Storrs, Connecticut, USA
| | - Gen Li
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA
| | - Kendra Maas
- Microbial Analysis, Resources, and Services, University of Connecticut, Storrs, Connecticut, USA
| | - Kun Chen
- Department of Statistics, University of Connecticut, Storrs, Connecticut, USA
| |
Collapse
|
41
|
A High Protein Diet Is More Effective in Improving Insulin Resistance and Glycemic Variability Compared to a Mediterranean Diet-A Cross-Over Controlled Inpatient Dietary Study. Nutrients 2021; 13:nu13124380. [PMID: 34959931 PMCID: PMC8707429 DOI: 10.3390/nu13124380] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Revised: 12/01/2021] [Accepted: 12/02/2021] [Indexed: 12/26/2022] Open
Abstract
The optimal dietary pattern to improve metabolic function remains elusive. In a 21-day randomized controlled inpatient crossover feeding trial of 20 insulin-resistant obese women, we assessed the extent to which two isocaloric dietary interventions—Mediterranean (M) and high protein (HP)—improved metabolic parameters. Obese women were assigned to one of the following dietary sequences: M–HP or HP–M. Cardiometabolic parameters, body weight, glucose monitoring and gut microbiome composition were assessed. Sixteen women completed the study. Compared to the M diet, the HP diet was more effective in (i) reducing insulin resistance (insulin: Beta (95% CI) = −6.98 (−12.30, −1.65) µIU/mL, p = 0.01; HOMA-IR: −1.78 (95% CI: −3.03, −0.52), p = 9 × 10−3); and (ii) improving glycemic variability (−3.13 (−4.60, −1.67) mg/dL, p = 4 × 10−4), a risk factor for T2D development. We then identified a panel of 10 microbial genera predictive of the difference in glycemic variability between the two diets. These include the genera Coprococcus and Lachnoclostridium, previously associated with glucose homeostasis and insulin resistance. Our results suggest that morbidly obese women with insulin resistance can achieve better control of insulin resistance and glycemic variability on a high HP diet compared to an M diet.
Collapse
|
42
|
Huang L, Little P, Huyghe JR, Shi Q, Harrison TA, Yothers G, George TJ, Peters U, Chan AT, Newcomb PA, Sun W. A Statistical Method for Association Analysis of Cell Type Compositions. STATISTICS IN BIOSCIENCES 2021; 13:373-385. [PMID: 35003378 PMCID: PMC8735261 DOI: 10.1007/s12561-020-09293-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2019] [Revised: 03/14/2020] [Accepted: 08/28/2020] [Indexed: 12/14/2022]
Abstract
Gene expression data are often collected from tissue samples that are composed of multiple cell types. Studies of cell type composition based on gene expression data from tissue samples have recently attracted increasing research interest and led to new method development for cell type composition estimation. This new information on cell type composition can be associated with individual characteristics (e.g., genetic variants) or clinical outcomes (e.g., survival time). Such association analysis can be conducted for each cell type separately followed by multiple testing correction. An alternative approach is to evaluate this association using the composition of all the cell types, thus aggregating association signals across cell types. A key challenge of this approach is to account for the dependence across cell types. We propose a new method to quantify the distances between cell types while accounting for their dependencies, and use this information for association analysis. We demonstrate our method in two applied examples: to assess the association between immune cell type composition in tumor samples of colorectal cancer patients versus survival time and SNP genotypes. We found immune cell composition has prognostic value, and our distance metric leads to more accurate survival time prediction than other distance metrics that ignore cell type dependencies. In addition, survival time-associated SNPs are enriched among the SNPs associated with immune cell composition.
Collapse
Affiliation(s)
- Licai Huang
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA
| | - Paul Little
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA
| | - Jeroen R Huyghe
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA
| | - Qian Shi
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN
| | - Tabitha A Harrison
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA
| | - Greg Yothers
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA
| | - Thomas J George
- Department of Medicine, University of Florida Health Cancer Center, Gainesville, FL
| | - Ulrike Peters
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA
| | - Andrew T Chan
- Massachusetts General Hospital and Harvard Medical School, Boston, MA
| | - Polly A Newcomb
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA
| | - Wei Sun
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA
| |
Collapse
|
43
|
Devlin SM, Martin A, Ostrovnaya I. Identifying prognostic pairwise relationships among bacterial species in microbiome studies. PLoS Comput Biol 2021; 17:e1009501. [PMID: 34752448 PMCID: PMC8631663 DOI: 10.1371/journal.pcbi.1009501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Revised: 11/30/2021] [Accepted: 09/28/2021] [Indexed: 11/18/2022] Open
Abstract
In recent literature, the human microbiome has been shown to have a major influence on human health. To investigate this impact, scientists study the composition and abundance of bacterial species, commonly using 16S rRNA gene sequencing, among patients with and without a disease or condition. Methods for such investigations to date have focused on the association between individual bacterium and an outcome, and higher-order pairwise relationships or interactions among bacteria are often avoided due to the substantial increase in dimension and the potential for spurious correlations. However, overlooking such relationships ignores the environment of the microbiome, where there is dynamic cooperation and competition among bacteria. We present a method for identifying and ranking pairs of bacteria that have a differential dichotomized relationship across outcomes. Our approach, implemented in an R package PairSeek, uses the stability selection framework with data-driven dichotomized forms of the pairwise relationships. We illustrate the properties of the proposed method using a published oral cancer data set and a simulation study. Within an ecological system, microbial communities represent complex relationships between bacteria, where they co-exist and interact with each other in multiple ways including cooperation and competition. Most existing statistical tools for examining the association between microbiota and a disease state, such as individuals with and without cancer, focus on individual bacterium in isolation, ignoring the dynamic environment in which it lives. In this manuscript, we propose an algorithm for assessing the association between pairs of bacteria and a disease state. The approach provides a mechanism to rank pairs of bacteria, from pairs with the most evidence of an association with the disease state to the least amount of evidence. This ranking helps generate hypotheses and prioritize bacteria for further investigation. We illustrate the algorithm using a publicly available data set of oral cancer patients.
Collapse
Affiliation(s)
- Sean M Devlin
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, United States of America
| | - Axel Martin
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, United States of America
| | - Irina Ostrovnaya
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, United States of America
| |
Collapse
|
44
|
Tomassi D, Forzani L, Duarte S, Pfeiffer RM. Sufficient dimension reduction for compositional data. Biostatistics 2021; 22:687-705. [PMID: 31886477 DOI: 10.1093/biostatistics/kxz060] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2019] [Revised: 12/03/2019] [Accepted: 12/04/2019] [Indexed: 11/12/2022] Open
Abstract
Recent efforts to characterize the human microbiome and its relation to chronic diseases have led to a surge in statistical development for compositional data. We develop likelihood-based sufficient dimension reduction methods (SDR) to find linear combinations that contain all the information in the compositional data on an outcome variable, i.e., are sufficient for modeling and prediction of the outcome. We consider several models for the inverse regression of the compositional vector or transformations of it, as a function of outcome. They include normal, multinomial, and Poisson graphical models that allow for complex dependencies among observed counts. These methods yield efficient estimators of the reduction and can be applied to continuous or categorical outcomes. We incorporate variable selection into the estimation via penalties and address important invariance issues arising from the compositional nature of the data. We illustrate and compare our methods and some established methods for analyzing microbiome data in simulations and using data from the Human Microbiome Project. Displaying the data in the coordinate system of the SDR linear combinations allows visual inspection and facilitates comparisons across studies.
Collapse
Affiliation(s)
- Diego Tomassi
- CONICET and Facultad de Ingeniería Química, Universidad Nacional sel Litoral, Santiago del estero 2829, 3000 Santa Fe, Argentina and Institut Charles Delaunay/ROSAS Department, Systems Modelling and Dependability Team, Université de Technologie de Troyes, 12 rue Marie Curie, 10004 Troyes Cedex, France
| | - Liliana Forzani
- CONICET and Facultad de Ingeniería Química, Universidad Nacional sel Litoral, Santiago del estero 2829, 3000 Santa Fe, Argentina
| | - Sabrina Duarte
- Biostatistics Branch, National Cancer Institute, 9609 Medical Center Drive, Bethesda, MD 20892, USA
| | | |
Collapse
|
45
|
Monti GS, Filzmoser P. Robust logistic zero-sum regression for microbiome compositional data. ADV DATA ANAL CLASSI 2021. [DOI: 10.1007/s11634-021-00465-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
AbstractWe introduce the Robust Logistic Zero-Sum Regression (RobLZS) estimator, which can be used for a two-class problem with high-dimensional compositional covariates. Since the log-contrast model is employed, the estimator is able to do feature selection among the compositional parts. The proposed method attains robustness by minimizing a trimmed sum of deviances. A comparison of the performance of the RobLZS estimator with a non-robust counterpart and with other sparse logistic regression estimators is conducted via Monte Carlo simulation studies. Two microbiome data applications are considered to investigate the stability of the estimators to the presence of outliers. Robust Logistic Zero-Sum Regression is available as an R package that can be downloaded at https://github.com/giannamonti/RobZS.
Collapse
|
46
|
Srinivasan A, Xue L, Zhan X. Compositional knockoff filter for high-dimensional regression analysis of microbiome data. Biometrics 2021; 77:984-995. [PMID: 32683674 PMCID: PMC7831267 DOI: 10.1111/biom.13336] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2019] [Revised: 06/29/2020] [Accepted: 07/09/2020] [Indexed: 01/10/2023]
Abstract
A critical task in microbiome data analysis is to explore the association between a scalar response of interest and a large number of microbial taxa that are summarized as compositional data at different taxonomic levels. Motivated by fine-mapping of the microbiome, we propose a two-step compositional knockoff filter to provide the effective finite-sample false discovery rate (FDR) control in high-dimensional linear log-contrast regression analysis of microbiome compositional data. In the first step, we propose a new compositional screening procedure to remove insignificant microbial taxa while retaining the essential sum-to-zero constraint. In the second step, we extend the knockoff filter to identify the significant microbial taxa in the sparse regression model for compositional data. Thereby, a subset of the microbes is selected from the high-dimensional microbial taxa as related to the response under a prespecified FDR threshold. We study the theoretical properties of the proposed two-step procedure, including both sure screening and effective false discovery control. We demonstrate these properties in numerical simulation studies to compare our methods to some existing ones and show power gain of the new method while controlling the nominal FDR. The potential usefulness of the proposed method is also illustrated with application to an inflammatory bowel disease data set to identify microbial taxa that influence host gene expressions.
Collapse
Affiliation(s)
- Arun Srinivasan
- Department of Statistics, Pennsylvania State University, University Park, PA 16802, U.S.A
| | - Lingzhou Xue
- Department of Statistics, Pennsylvania State University, University Park, PA 16802, U.S.A
| | - Xiang Zhan
- Department of Public Health Sciences, Pennsylvania State University, Hershey, PA 17033, U.S.A
| |
Collapse
|
47
|
Monti GS, Filzmoser P. Sparse least trimmed squares regression with compositional covariates for high dimensional data. Bioinformatics 2021; 37:3805-3814. [PMID: 34358286 DOI: 10.1093/bioinformatics/btab572] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Revised: 07/08/2021] [Accepted: 08/03/2021] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION High-throughput sequencing technologies generate a huge amount of data, permitting the quantification of microbiome compositions. The obtained data are essentially sparse compositional data vectors, namely vectors of bacterial gene proportions which compose the microbiome. Subsequently, the need for statistical and computational methods that consider the special nature of microbiome data has increased. A critical aspect in microbiome research is to identify microbes associated with a clinical outcome. Another crucial aspect with high-dimensional data is the detection of outlying observations, whose presence affects seriously the prediction accuracy. RESULTS In this article we connect robustness and sparsity in the context of variable selection in regression with compositional covariates with a continuous response. The compositional character of the covariates is taken into account by a linear log-contrast model, and elastic-net regularization achieves sparsity in the regression coefficient estimates. Robustness is obtained by performing trimming in the objective function of the estimator. A reweighting step increases the efficiency of the estimator, and it also allows for diagnostics in terms of outlier identification. The numerical performance of the proposed method is evaluated via simulation studies, and its usefulness is illustrated by an application to a microbiome study with the aim to predict caffeine intake based on the human gut microbiome composition. AVAILABILITY The R-package "RobZS" can be downloaded at https://github.com/giannamonti/RobZS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gianna Serafina Monti
- Department of Economics, Management and Statistics, University of Milano-Bicocca, Italy
| | - Peter Filzmoser
- Institute of Statistics & Mathematical Methods in Economics, Vienna University of Technology, Austria
| |
Collapse
|
48
|
Schultheiss UT, Kosch R, Kotsis F, Altenbuchinger M, Zacharias HU. Chronic Kidney Disease Cohort Studies: A Guide to Metabolome Analyses. Metabolites 2021; 11:460. [PMID: 34357354 PMCID: PMC8304377 DOI: 10.3390/metabo11070460] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Revised: 07/08/2021] [Accepted: 07/12/2021] [Indexed: 12/14/2022] Open
Abstract
Kidney diseases still pose one of the biggest challenges for global health, and their heterogeneity and often high comorbidity load seriously hinders the unraveling of their underlying pathomechanisms and the delivery of optimal patient care. Metabolomics, the quantitative study of small organic compounds, called metabolites, in a biological specimen, is gaining more and more importance in nephrology research. Conducting a metabolomics study in human kidney disease cohorts, however, requires thorough knowledge about the key workflow steps: study planning, sample collection, metabolomics data acquisition and preprocessing, statistical/bioinformatics data analysis, and results interpretation within a biomedical context. This review provides a guide for future metabolomics studies in human kidney disease cohorts. We will offer an overview of important a priori considerations for metabolomics cohort studies, available analytical as well as statistical/bioinformatics data analysis techniques, and subsequent interpretation of metabolic findings. We will further point out potential research questions for metabolomics studies in the context of kidney diseases and summarize the main results and data availability of important studies already conducted in this field.
Collapse
Affiliation(s)
- Ulla T. Schultheiss
- Institute of Genetic Epidemiology, Faculty of Medicine and Medical Center, University of Freiburg, 79106 Freiburg, Germany; (U.T.S.); (F.K.)
- Department of Medicine IV—Nephrology and Primary Care, Faculty of Medicine and Medical Center, University of Freiburg, 79106 Freiburg, Germany
| | - Robin Kosch
- Computational Biology, University of Hohenheim, 70599 Stuttgart, Germany;
| | - Fruzsina Kotsis
- Institute of Genetic Epidemiology, Faculty of Medicine and Medical Center, University of Freiburg, 79106 Freiburg, Germany; (U.T.S.); (F.K.)
- Department of Medicine IV—Nephrology and Primary Care, Faculty of Medicine and Medical Center, University of Freiburg, 79106 Freiburg, Germany
| | - Michael Altenbuchinger
- Institute of Medical Bioinformatics, University Medical Center Göttingen, 37077 Göttingen, Germany;
| | - Helena U. Zacharias
- Department of Internal Medicine I, University Medical Center Schleswig-Holstein, Campus Kiel, 24105 Kiel, Germany
- Institute of Clinical Molecular Biology, Kiel University and University Medical Center Schleswig-Holstein, Campus Kiel, 24105 Kiel, Germany
| |
Collapse
|
49
|
Mahmoudian M, Venäläinen MS, Klén R, Elo LL. Stable Iterative Variable Selection. Bioinformatics 2021; 37:4810-4817. [PMID: 34270690 PMCID: PMC8665768 DOI: 10.1093/bioinformatics/btab501] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2020] [Revised: 05/20/2021] [Accepted: 07/14/2021] [Indexed: 11/13/2022] Open
Abstract
Motivation The emergence of datasets with tens of thousands of features, such as high-throughput omics biomedical data, highlights the importance of reducing the feature space into a distilled subset that can truly capture the signal for research and industry by aiding in finding more effective biomarkers for the question in hand. A good feature set also facilitates building robust predictive models with improved interpretability and convergence of the applied method due to the smaller feature space. Results Here, we present a robust feature selection method named Stable Iterative Variable Selection (SIVS) and assess its performance over both omics and clinical data types. As a performance assessment metric, we compared the number and goodness of the selected feature using SIVS to those selected by Least Absolute Shrinkage and Selection Operator regression. The results suggested that the feature space selected by SIVS was, on average, 41% smaller, without having a negative effect on the model performance. A similar result was observed for comparison with Boruta and caret RFE. Availability and implementation The method is implemented as an R package under GNU General Public License v3.0 and is accessible via Comprehensive R Archive Network (CRAN) via https://cran.r-project.org/package=sivs. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mehrad Mahmoudian
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Turku, Finland.,Department of Future Technologies, University of Turku, Turku, Finland
| | - Mikko S Venäläinen
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Turku, Finland
| | - Riku Klén
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Turku, Finland
| | - Laura L Elo
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Turku, Finland.,Institute of Biomedicine, University of Turku, Turku, Finland
| |
Collapse
|
50
|
Bien J, Yan X, Simpson L, Müller CL. Tree-aggregated predictive modeling of microbiome data. Sci Rep 2021; 11:14505. [PMID: 34267244 PMCID: PMC8282688 DOI: 10.1038/s41598-021-93645-3] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 06/22/2021] [Indexed: 01/05/2023] Open
Abstract
Modern high-throughput sequencing technologies provide low-cost microbiome survey data across all habitats of life at unprecedented scale. At the most granular level, the primary data consist of sparse counts of amplicon sequence variants or operational taxonomic units that are associated with taxonomic and phylogenetic group information. In this contribution, we leverage the hierarchical structure of amplicon data and propose a data-driven and scalable tree-guided aggregation framework to associate microbial subcompositions with response variables of interest. The excess number of zero or low count measurements at the read level forces traditional microbiome data analysis workflows to remove rare sequencing variants or group them by a fixed taxonomic rank, such as genus or phylum, or by phylogenetic similarity. By contrast, our framework, which we call trac (tree-aggregation of compositional data), learns data-adaptive taxon aggregation levels for predictive modeling, greatly reducing the need for user-defined aggregation in preprocessing while simultaneously integrating seamlessly into the compositional data analysis framework. We illustrate the versatility of our framework in the context of large-scale regression problems in human gut, soil, and marine microbial ecosystems. We posit that the inferred aggregation levels provide highly interpretable taxon groupings that can help microbiome researchers gain insights into the structure and functioning of the underlying ecosystem of interest.
Collapse
Affiliation(s)
- Jacob Bien
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA, USA
| | | | - Léo Simpson
- Technische Universität München, Munich, Germany
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
| | - Christian L Müller
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany.
- Department of Statistics, Ludwig-Maximilians-Universität München, Munich, Germany.
- Center for Computational Mathematics, Flatiron Institute, Simons Foundation, New York, NY, USA.
| |
Collapse
|