1
|
Mishra AK, Mahmud I, Lorenzi PL, Jenq RR, Wargo JA, Ajami NJ, Peterson CB. TARO: tree-aggregated factor regression for microbiome data integration. Bioinformatics 2024; 40:btae321. [PMID: 38788190 PMCID: PMC11193058 DOI: 10.1093/bioinformatics/btae321] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 03/16/2024] [Accepted: 05/15/2024] [Indexed: 05/26/2024] Open
Abstract
MOTIVATION Although the human microbiome plays a key role in health and disease, the biological mechanisms underlying the interaction between the microbiome and its host are incompletely understood. Integration with other molecular profiling data offers an opportunity to characterize the role of the microbiome and elucidate therapeutic targets. However, this remains challenging to the high dimensionality, compositionality, and rare features found in microbiome profiling data. These challenges necessitate the use of methods that can achieve structured sparsity in learning cross-platform association patterns. RESULTS We propose Tree-Aggregated factor RegressiOn (TARO) for the integration of microbiome and metabolomic data. We leverage information on the taxonomic tree structure to flexibly aggregate rare features. We demonstrate through simulation studies that TARO accurately recovers a low-rank coefficient matrix and identifies relevant features. We applied TARO to microbiome and metabolomic profiles gathered from subjects being screened for colorectal cancer to understand how gut microrganisms shape intestinal metabolite abundances. AVAILABILITY AND IMPLEMENTATION The R package TARO implementing the proposed methods is available online at https://github.com/amishra-stats/taro-package.
Collapse
Affiliation(s)
- Aditya K Mishra
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
- Platform for Innovative Microbiome and Translational Research (PRIME-TR), The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Iqbal Mahmud
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Philip L Lorenzi
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Robert R Jenq
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
- Platform for Innovative Microbiome and Translational Research (PRIME-TR), The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Jennifer A Wargo
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
- Platform for Innovative Microbiome and Translational Research (PRIME-TR), The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
- Department of Surgical Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Nadim J Ajami
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
- Platform for Innovative Microbiome and Translational Research (PRIME-TR), The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Christine B Peterson
- Platform for Innovative Microbiome and Translational Research (PRIME-TR), The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| |
Collapse
|
2
|
Abstract
The microbiome represents a hidden world of tiny organisms populating not only our surroundings but also our own bodies. By enabling comprehensive profiling of these invisible creatures, modern genomic sequencing tools have given us an unprecedented ability to characterize these populations and uncover their outsize impact on our environment and health. Statistical analysis of microbiome data is critical to infer patterns from the observed abundances. The application and development of analytical methods in this area require careful consideration of the unique aspects of microbiome profiles. We begin this review with a brief overview of microbiome data collection and processing and describe the resulting data structure. We then provide an overview of statistical methods for key tasks in microbiome data analysis, including data visualization, comparison of microbial abundance across groups, regression modeling, and network inference. We conclude with a discussion and highlight interesting future directions.
Collapse
Affiliation(s)
- Christine B Peterson
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| | - Satabdi Saha
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| | - Kim-Anh Do
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| |
Collapse
|
3
|
Zhang Q, Coury R, Tang W. Prediction of conversion from mild cognitive impairment to Alzheimer's disease and simultaneous feature selection and grouping using Medicaid claim data. Alzheimers Res Ther 2024; 16:54. [PMID: 38461266 PMCID: PMC10924319 DOI: 10.1186/s13195-024-01421-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Accepted: 02/27/2024] [Indexed: 03/11/2024]
Abstract
BACKGROUND Due to the heterogeneity among patients with Mild Cognitive Impairment (MCI), it is critical to predict their risk of converting to Alzheimer's disease (AD) early using routinely collected real-world data such as the electronic health record data or administrative claim data. METHODS The study used MarketScan Multi-State Medicaid data to construct a cohort of MCI patients. Logistic regression with tree-guided lasso regularization (TGL) was proposed to select important features and predict the risk of converting to AD. A subsampling-based technique was used to extract robust groups of predictive features. Predictive models including logistic regression, generalized random forest, and artificial neural network were trained using the extracted features. RESULTS The proposed TGL workflow selected feature groups that were robust, highly interpretable, and consistent with existing literature. The predictive models using TGL selected features demonstrated higher prediction accuracy than the models using all features or features selected using other methods. CONCLUSIONS The identified feature groups provide insights into the progression from MCI to AD and can potentially improve risk prediction in clinical practice and trial recruitment.
Collapse
Affiliation(s)
- Qi Zhang
- Department of Mathematics and Statistics, University of New Hampshire, Durham, NH, 03824, USA.
| | - Ron Coury
- Department of Mathematics and Statistics, University of New Hampshire, Durham, NH, 03824, USA
| | - Wenlong Tang
- Takeda Pharmaceuticals, Cambridge, MA, 02142, USA
| |
Collapse
|
4
|
Molstad AJ, Motwani K. Multiresolution categorical regression for interpretable cell-type annotation. Biometrics 2023; 79:3485-3496. [PMID: 37798600 DOI: 10.1111/biom.13926] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2022] [Accepted: 08/07/2023] [Indexed: 10/07/2023]
Abstract
In many categorical response regression applications, the response categories admit a multiresolution structure. That is, subsets of the response categories may naturally be combined into coarser response categories. In such applications, practitioners are often interested in estimating the resolution at which a predictor affects the response category probabilities. In this paper, we propose a method for fitting the multinomial logistic regression model in high dimensions that addresses this problem in a unified and data-driven way. Our method allows practitioners to identify which predictors distinguish between coarse categories but not fine categories, which predictors distinguish between fine categories, and which predictors are irrelevant. For model fitting, we propose a scalable algorithm that can be applied when the coarse categories are defined by either overlapping or nonoverlapping sets of fine categories. Statistical properties of our method reveal that it can take advantage of this multiresolution structure in a way existing estimators cannot. We use our method to model cell-type probabilities as a function of a cell's gene expression profile (i.e., cell-type annotation). Our fitted model provides novel biological insights which may be useful for future automated and manual cell-type annotation methodology.
Collapse
Affiliation(s)
- Aaron J Molstad
- School of Statistics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Keshav Motwani
- Department of Biostatistics, University of Washington, Seattle, Washington, USA
| |
Collapse
|
5
|
Mishra AK, Mahmud I, Lorenzi PL, Jenq RR, Wargo JA, Ajami NJ, Peterson CB. TARO: tree-aggregated factor regression for microbiome data integration. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.17.562792. [PMID: 37904958 PMCID: PMC10614880 DOI: 10.1101/2023.10.17.562792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/01/2023]
Abstract
Motivation Although the human microbiome plays a key role in health and disease, the biological mechanisms underlying the interaction between the microbiome and its host are incompletely understood. Integration with other molecular profiling data offers an opportunity to characterize the role of the microbiome and elucidate therapeutic targets. However, this remains challenging to the high dimensionality, compositionality, and rare features found in microbiome profiling data. These challenges necessitate the use of methods that can achieve structured sparsity in learning cross-platform association patterns. Results We propose Tree-Aggregated factor RegressiOn (TARO) for the integration of microbiome and metabolomic data. We leverage information on the phylogenetic tree structure to flexibly aggregate rare features. We demonstrate through simulation studies that TARO accurately recovers a low-rank coefficient matrix and identifies relevant features. We applied TARO to microbiome and metabolomic profiles gathered from subjects being screened for colorectal cancer to understand how gut microrganisms shape intestinal metabolite abundances. Availability and implementation The R package TARO implementing the proposed methods is available online at https://github.com/amishra-stats/taro-package .
Collapse
|
6
|
Zhao Y, Wang B, Liu CF, Faria AV, Miller MI, Caffo BS, Luo X. Identifying brain hierarchical structures associated with Alzheimer's disease using a regularized regression method with tree predictors. Biometrics 2023; 79:2333-2345. [PMID: 36263865 PMCID: PMC10115907 DOI: 10.1111/biom.13775] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Accepted: 10/03/2022] [Indexed: 11/30/2022]
Abstract
Brain segmentation at different levels is generally represented as hierarchical trees. Brain regional atrophy at specific levels was found to be marginally associated with Alzheimer's disease outcomes. In this study, we propose an ℓ1 -type regularization for predictors that follow a hierarchical tree structure. Considering a tree as a directed acyclic graph, we interpret the model parameters from a path analysis perspective. Under this concept, the proposed penalty regulates the total effect of each predictor on the outcome. With regularity conditions, it is shown that under the proposed regularization, the estimator of the model coefficient is consistent in ℓ2 -norm and the model selection is also consistent. When applied to a brain sMRI dataset acquired from the Alzheimer's Disease Neuroimaging Initiative (ADNI), the proposed approach identifies brain regions where atrophy in these regions demonstrates the declination in memory. With regularization on the total effects, the findings suggest that the impact of atrophy on memory deficits is localized from small brain regions, but at various levels of brain segmentation. Data used in preparation of this paper were obtained from the ADNI database.
Collapse
Affiliation(s)
- Yi Zhao
- Department of Biostatistics and Health Data Science, Indiana University School of Medicine, Indianapolis, Indiana, USA
| | - Bingkai Wang
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA
| | - Chin-Fu Liu
- Center for Imaging Science, Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, USA
| | - Andreia V. Faria
- Department of Radiology, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Michael I. Miller
- Center for Imaging Science, Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, USA
| | - Brian S. Caffo
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA
| | - Xi Luo
- Department of Biostatistics and Data Science, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| |
Collapse
|
7
|
Li G, Li Y, Chen K. It's all relative: Regression analysis with compositional predictors. Biometrics 2023; 79:1318-1329. [PMID: 35616500 PMCID: PMC9767704 DOI: 10.1111/biom.13703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Accepted: 05/18/2022] [Indexed: 01/05/2023]
Abstract
Compositional data reside in a simplex and measure fractions or proportions of parts to a whole. Most existing regression methods for such data rely on log-ratio transformations that are inadequate or inappropriate in modeling high-dimensional data with excessive zeros and hierarchical structures. Moreover, such models usually lack a straightforward interpretation due to the interrelation between parts of a composition. We develop a novel relative-shift regression framework that directly uses proportions as predictors. The new framework provides a paradigm shift for regression analysis with compositional predictors and offers a superior interpretation of how shifting concentration between parts affects the response. New equi-sparsity and tree-guided regularization methods and an efficient smoothing proximal gradient algorithm are developed to facilitate feature aggregation and dimension reduction in regression. A unified finite-sample prediction error bound is derived for the proposed regularized estimators. We demonstrate the efficacy of the proposed methods in extensive simulation studies and a real gut microbiome study. Guided by the taxonomy of the microbiome data, the framework identifies important taxa at different taxonomic levels associated with the neurodevelopment of preterm infants.
Collapse
Affiliation(s)
- Gen Li
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor., Michigan, USA
| | - Yan Li
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor., Michigan, USA
| | - Kun Chen
- Department of Statistics, University of Connecticut, Connecticut, USA
| |
Collapse
|
8
|
Zhong W, Qian C, Liu W, Zhu L, Li R. Feature Screening for Interval-Valued Response with Application to Study Association between Posted Salary and Required Skills. J Am Stat Assoc 2023; 118:805-817. [PMID: 37448462 PMCID: PMC10338024 DOI: 10.1080/01621459.2022.2152342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Revised: 11/17/2022] [Accepted: 11/22/2022] [Indexed: 12/05/2022]
Abstract
It is important to quantify the differences in returns to skills using the online job advertisements data, which have attracted great interest in both labor economics and statistics fields. In this paper, we study the relationship between the posted salary and the job requirements in online labor markets. There are two challenges to deal with. First, the posted salary is always presented in an interval-valued form, for example, 5k-10k yuan per month. Simply taking the mid-point or the lower bound as the alternative for salary may result in biased estimators. Second, the number of the potential skill words as predictors generated from the job advertisements by word segmentation is very large and many of them may not contribute to the salary. To this end, we propose a new feature screening method, Absolute Distribution Difference Sure Independence Screening (ADD-SIS), to select important skill words for the interval-valued response. The marginal utility for feature screening is based on the difference of estimated distribution functions via nonparametric maximum likelihood estimation, which sufficiently uses the interval information. It is model-free and robust to outliers. Numerical simulations show that the new method using the interval information is more efficient to select important predictors than the methods only based on the single points of the intervals. In the real data application, we study the text data of job advertisements for data scientists and data analysts in a major China's online job posting website, and explore the important skill words for the salary. We find that the skill words like optimization, long short-term memory (LSTM), convolutional neural networks (CNN), collaborative filtering, are positively correlated with the salary while the words like Excel, Office, data collection, may negatively contribute to the salary.
Collapse
|
9
|
Liu M, Zhang Q, Ma S. A tree-based gene-environment interaction analysis with rare features. Stat Anal Data Min 2022; 15:648-674. [PMID: 38046814 PMCID: PMC10691867 DOI: 10.1002/sam.11578] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2021] [Accepted: 02/14/2022] [Indexed: 01/20/2023]
Abstract
Gene-environment (G-E) interaction analysis plays a critical role in understanding and modeling complex diseases. Compared to main-effect-only analysis, it is more seriously challenged by higher dimensionality, weaker signals, and the unique "main effects, interactions" variable selection hierarchy. In joint G-E interaction analysis under which a large number of G factors are analysed in a single model, effort tailored to rare features (e.g., SNPs with low minor allele frequencies) has been limited. Existing investigations on rare features have been mostly focused on marginal analysis, where various data aggregation techniques have been developed, and hypothesis testings have been conducted to identify significant aggregated features. However, such techniques cannot be extended to joint G-E interaction analysis. In this study, building on a very recent tree-based data aggregation technique, which has been developed for main-effect-only analysis, we develop a new G-E interaction analysis approach tailored to rare features. The adopted data aggregation technique allows for more efficient information borrowing from neighboring rare features. Similar to some existing state-of-the-art ones, the proposed approach adopts penalization for variable selection, regularized estimation, and respect of the variable selection hierarchy. Simulation shows that it has more accurate identification of important interactions and main effects than several competing alternatives. In the analysis of NFBC1966 study, the proposed approach leads to findings different from the alternatives and with satisfactory prediction and stability performance.
Collapse
Affiliation(s)
- Mengque Liu
- School of Journalism and New Media, Xi’an Jiaotong Universit0y, Shanxi Xi’an, China
| | - Qingzhao Zhang
- Department of Statistics and Data Science, School of Economics, Wang Yanan Institute for Studies in Economics, and Fujian Key Lab of Statistics, Xiamen University, Fujian Xiamen, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut, USA
| |
Collapse
|
10
|
Hecq A, Ternes M, Wilms I. Hierarchical Regularizers for Mixed-Frequency Vector Autoregressions. J Comput Graph Stat 2022. [DOI: 10.1080/10618600.2022.2058003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Alain Hecq
- Department of Quantitative Economics, Maastricht University
| | - Marie Ternes
- Department of Quantitative Economics, Maastricht University
| | - Ines Wilms
- Department of Quantitative Economics, Maastricht University
| |
Collapse
|
11
|
Wang B, Caffo BS, Luo X, Liu C, Faria AV, Miller MI, Zhao Y. Regularized regression on compositional trees with application to MRI analysis. J R Stat Soc Ser C Appl Stat 2022; 71:541-561. [DOI: 10.1111/rssc.12545] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Affiliation(s)
- Bingkai Wang
- Department of BiostatisticsJohns Hopkins Bloomberg School of Public Health BaltimoreMarylandUSA
| | - Brian S. Caffo
- Department of BiostatisticsJohns Hopkins Bloomberg School of Public Health BaltimoreMarylandUSA
| | - Xi Luo
- Department of Biostatistics and Data ScienceThe University of Texas Health Science Center at Houston HoustonTexasUSA
| | - Chin‐Fu Liu
- Center for Imaging Science, Biomedical EngineeringJohns Hopkins University BaltimoreMarylandUSA
| | - Andreia V. Faria
- Department of RadiologyJohns Hopkins University School of Medicine BaltimoreMarylandUSA
| | - Michael I. Miller
- Center for Imaging Science, Biomedical EngineeringJohns Hopkins University BaltimoreMarylandUSA
| | - Yi Zhao
- Department of BiostatisticsIndiana University School of Medicine and for the Alzheimer's Disease Neuroimaging Initiative IndianapolisIndianaUSA
| | | |
Collapse
|
12
|
Ostner J, Carcy S, Müller CL. tascCODA: Bayesian Tree-Aggregated Analysis of Compositional Amplicon and Single-Cell Data. Front Genet 2021; 12:766405. [PMID: 34950190 PMCID: PMC8689185 DOI: 10.3389/fgene.2021.766405] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2021] [Accepted: 11/01/2021] [Indexed: 12/11/2022] Open
Abstract
Accurate generative statistical modeling of count data is of critical relevance for the analysis of biological datasets from high-throughput sequencing technologies. Important instances include the modeling of microbiome compositions from amplicon sequencing surveys and the analysis of cell type compositions derived from single-cell RNA sequencing. Microbial and cell type abundance data share remarkably similar statistical features, including their inherent compositionality and a natural hierarchical ordering of the individual components from taxonomic or cell lineage tree information, respectively. To this end, we introduce a Bayesian model for tree-aggregated amplicon and single-cell compositional data analysis (tascCODA) that seamlessly integrates hierarchical information and experimental covariate data into the generative modeling of compositional count data. By combining latent parameters based on the tree structure with spike-and-slab Lasso penalization, tascCODA can determine covariate effects across different levels of the population hierarchy in a data-driven parsimonious way. In the context of differential abundance testing, we validate tascCODA's excellent performance on a comprehensive set of synthetic benchmark scenarios. Our analyses on human single-cell RNA-seq data from ulcerative colitis patients and amplicon data from patients with irritable bowel syndrome, respectively, identified aggregated cell type and taxon compositional changes that were more predictive and parsimonious than those proposed by other schemes. We posit that tascCODA constitutes a valuable addition to the growing statistical toolbox for generative modeling and analysis of compositional changes in microbial or cell population data.
Collapse
Affiliation(s)
- Johannes Ostner
- Department of Statistics, Ludwig-Maximilians-Universität München, Munich, Germany
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
| | - Salomé Carcy
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
- Department of Biology, École Normale Supérieure, PSL University, Paris, France
| | - Christian L. Müller
- Department of Statistics, Ludwig-Maximilians-Universität München, Munich, Germany
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
- Center for Computational Mathematics, Flatiron Institute, New York, NY, United States
| |
Collapse
|
13
|
Bien J, Yan X, Simpson L, Müller CL. Tree-aggregated predictive modeling of microbiome data. Sci Rep 2021; 11:14505. [PMID: 34267244 PMCID: PMC8282688 DOI: 10.1038/s41598-021-93645-3] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 06/22/2021] [Indexed: 01/05/2023] Open
Abstract
Modern high-throughput sequencing technologies provide low-cost microbiome survey data across all habitats of life at unprecedented scale. At the most granular level, the primary data consist of sparse counts of amplicon sequence variants or operational taxonomic units that are associated with taxonomic and phylogenetic group information. In this contribution, we leverage the hierarchical structure of amplicon data and propose a data-driven and scalable tree-guided aggregation framework to associate microbial subcompositions with response variables of interest. The excess number of zero or low count measurements at the read level forces traditional microbiome data analysis workflows to remove rare sequencing variants or group them by a fixed taxonomic rank, such as genus or phylum, or by phylogenetic similarity. By contrast, our framework, which we call trac (tree-aggregation of compositional data), learns data-adaptive taxon aggregation levels for predictive modeling, greatly reducing the need for user-defined aggregation in preprocessing while simultaneously integrating seamlessly into the compositional data analysis framework. We illustrate the versatility of our framework in the context of large-scale regression problems in human gut, soil, and marine microbial ecosystems. We posit that the inferred aggregation levels provide highly interpretable taxon groupings that can help microbiome researchers gain insights into the structure and functioning of the underlying ecosystem of interest.
Collapse
Affiliation(s)
- Jacob Bien
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA, USA
| | | | - Léo Simpson
- Technische Universität München, Munich, Germany
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
| | - Christian L Müller
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany.
- Department of Statistics, Ludwig-Maximilians-Universität München, Munich, Germany.
- Center for Computational Mathematics, Flatiron Institute, Simons Foundation, New York, NY, USA.
| |
Collapse
|
14
|
Xu S, Dai B, Wang J. Sentiment analysis with covariate-assisted word embeddings. Electron J Stat 2021. [DOI: 10.1214/21-ejs1854] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Shirong Xu
- School of Data Science, City University of Hong Kong, Kowloon Tong, Hong Kong
| | - Ben Dai
- School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA
| | - Junhui Wang
- School of Data Science, City University of Hong Kong, Kowloon Tong, Hong Kong
| |
Collapse
|