1
|
Wang C, Ahn J, Tarpey T, Yi SS, Hayes RB, Li H. A microbial causal mediation analytic tool for health disparity and applications in body mass index. MICROBIOME 2023; 11:164. [PMID: 37496080 PMCID: PMC10373330 DOI: 10.1186/s40168-023-01608-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Accepted: 06/22/2023] [Indexed: 07/28/2023]
Abstract
BACKGROUND Emerging evidence suggests the potential mediating role of microbiome in health disparities. However, no analytic framework can be directly used to analyze microbiome as a mediator between health disparity and clinical outcome, due to the non-manipulable nature of the exposure and the unique structure of microbiome data, including high dimensionality, sparsity, and compositionality. METHODS Considering the modifiable and quantitative features of the microbiome, we propose a microbial causal mediation model framework, SparseMCMM_HD, to uncover the mediating role of microbiome in health disparities, by depicting a plausible path from a non-manipulable exposure (e.g., ethnicity or region) to the outcome through the microbiome. The proposed SparseMCMM_HD rigorously defines and quantifies the manipulable disparity measure that would be eliminated by equalizing microbiome profiles between comparison and reference groups and innovatively and successfully extends the existing microbial mediation methods, which are originally proposed under potential outcome or counterfactual outcome study design, to address health disparities. RESULTS Through three body mass index (BMI) studies selected from the curatedMetagenomicData 3.4.2 package and the American gut project: China vs. USA, China vs. UK, and Asian or Pacific Islander (API) vs. Caucasian, we exhibit the utility of the proposed SparseMCMM_HD framework for investigating the microbiome's contributions in health disparities. Specifically, BMI exhibits disparities and microbial community diversities are significantly distinctive between reference and comparison groups in all three applications. By employing SparseMCMM_HD, we illustrate that microbiome plays a crucial role in explaining the disparities in BMI between ethnicities or regions. 20.63%, 33.09%, and 25.71% of the overall disparity in BMI in China-USA, China-UK, and API-Caucasian comparisons, respectively, would be eliminated if the between-group microbiome profiles were equalized; and 15, 18, and 16 species are identified to play the mediating role respectively. CONCLUSIONS The proposed SparseMCMM_HD is an effective and validated tool to elucidate the mediating role of microbiome in health disparity. Three BMI applications shed light on the utility of microbiome in reducing BMI disparity by manipulating microbial profiles. Video Abstract.
Collapse
Affiliation(s)
- Chan Wang
- Department of Population Health, Division of Biostatistics, New York University Grossman School of Medicine, New York, NY, 10016, USA
| | - Jiyoung Ahn
- Department of Population Health, Division of Epidemiology, New York University Grossman School of Medicine, New York, NY, 10016, USA
| | - Thaddeus Tarpey
- Department of Population Health, Division of Biostatistics, New York University Grossman School of Medicine, New York, NY, 10016, USA
| | - Stella S Yi
- Department of Population Health Section for Health Equity, New York University Grossman School of Medicine, New York, 10016, USA
| | - Richard B Hayes
- Department of Population Health, Division of Epidemiology, New York University Grossman School of Medicine, New York, NY, 10016, USA
| | - Huilin Li
- Department of Population Health, Division of Biostatistics, New York University Grossman School of Medicine, New York, NY, 10016, USA.
| |
Collapse
|
2
|
Zhao Y, Sun L. A stable and adaptive polygenic signal detection method based on repeated sample splitting. CAN J STAT 2023. [DOI: 10.1002/cjs.11768] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/03/2023]
|
3
|
Hu X, Lei J. A Two-Sample Conditional Distribution Test Using Conformal Prediction and Weighted Rank Sum. J Am Stat Assoc 2023. [DOI: 10.1080/01621459.2023.2177165] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/09/2023]
Affiliation(s)
- Xiaoyu Hu
- School of Mathematical Sciences, Center for Statistical Science, Peking University, China
| | - Jing Lei
- Department of Statistics and Data Science, Carnegie Mellon University, USA
| |
Collapse
|
4
|
Choi W, Kim I. Averaging p-values under exchangeability. Stat Probab Lett 2022. [DOI: 10.1016/j.spl.2022.109748] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
5
|
Mourtada J. Exact minimax risk for linear least squares, and the lower tail of sample covariance matrices. Ann Stat 2022. [DOI: 10.1214/22-aos2181] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
6
|
Mixed-effect models with trees. ADV DATA ANAL CLASSI 2022. [DOI: 10.1007/s11634-022-00509-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
AbstractTree-based regression models are a class of statistical models for predicting continuous response variables when the shape of the regression function is unknown. They naturally take into account both non-linearities and interactions. However, they struggle with linear and quasi-linear effects and assume iid data. This article proposes two new algorithms for jointly estimating an interpretable predictive mixed-effect model with two components: a linear part, capturing the main effects, and a non-parametric component consisting of three trees for capturing non-linearities and interactions among individual-level predictors, among cluster-level predictors or cross-level. The first proposed algorithm focuses on prediction. The second one is an extension which implements a post-selection inference strategy to provide valid inference. The performance of the two algorithms is validated via Monte Carlo studies. An application on INVALSI data illustrates the potentiality of the proposed approach.
Collapse
|
7
|
Conzuelo Rodriguez G, Bodnar LM, Brooks MM, Wahed A, Kennedy EH, Schisterman E, Naimi AI. Performance Evaluation of Parametric and Nonparametric Methods When Assessing Effect Measure Modification. Am J Epidemiol 2022; 191:198-207. [PMID: 34409985 DOI: 10.1093/aje/kwab220] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2020] [Revised: 08/13/2021] [Accepted: 08/13/2021] [Indexed: 12/20/2022] Open
Abstract
Effect measure modification is often evaluated using parametric models. These models, although efficient when correctly specified, make strong parametric assumptions. While nonparametric models avoid important functional form assumptions, they often require larger samples to achieve a given accuracy. We conducted a simulation study to evaluate performance tradeoffs between correctly specified parametric and nonparametric models to detect effect modification of a binary exposure by both binary and continuous modifiers. We evaluated generalized linear models and doubly robust (DR) estimators, with and without sample splitting. Continuous modifiers were modeled with cubic splines, fractional polynomials, and nonparametric DR-learner. For binary modifiers, generalized linear models showed the greatest power to detect effect modification, ranging from 0.42 to 1.00 in the worst and best scenario, respectively. Augmented inverse probability weighting had the lowest power, with an increase of 23% when using sample splitting. For continuous modifiers, the DR-learner was comparable to flexible parametric models in capturing quadratic and nonlinear monotonic functions. However, for nonlinear, nonmonotonic functions, the DR-learner had lower integrated bias than splines and fractional polynomials, with values of 141.3, 251.7, and 209.0, respectively. Our findings suggest comparable performance between nonparametric and correctly specified parametric models in evaluating effect modification.
Collapse
|
8
|
Zhang D, Khalili A, Asgharian M. Post-model-selection inference in linear regression models: An integrated review. STATISTICS SURVEYS 2022. [DOI: 10.1214/22-ss135] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Dongliang Zhang
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA
| | - Abbas Khalili
- Department of Mathematics and Statistics, McGill University, Montréal, QC, Canada
| | - Masoud Asgharian
- Department of Mathematics and Statistics, McGill University, Montréal, QC, Canada
| |
Collapse
|
9
|
Scharf F, Widmann A, Bonmassar C, Wetzel N. A tutorial on the use of temporal principal component analysis in developmental ERP research – opportunities and challenges. Dev Cogn Neurosci 2022; 54:101072. [PMID: 35123341 PMCID: PMC8819392 DOI: 10.1016/j.dcn.2022.101072] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2021] [Revised: 12/02/2021] [Accepted: 01/15/2022] [Indexed: 11/06/2022] Open
Abstract
Developmental researchers are often interested in event-related potentials (ERPs). Data-analytic approaches based on the observed ERP suffer from major problems such as arbitrary definition of analysis time windows and regions of interest and the observed ERP being a mixture of latent underlying components. Temporal principal component analysis (PCA) can reduce these problems. However, its application in developmental research comes with the unique challenge that the component structure differs between age groups (so-called measurement non-invariance). Separate PCAs for the groups can cope with this challenge. We demonstrate how to make results from separate PCAs accessible for inferential statistics by re-scaling to original units. This tutorial enables readers with a focus on developmental research to conduct a PCA-based ERP analysis of amplitude differences. We explain the benefits of a PCA-based approach, introduce the PCA model and demonstrate its application to a developmental research question using real-data from a child and an adult group (code and data openly available). Finally, we discuss how to cope with typical challenges during the analysis and name potential limitations such as suboptimal decomposition results, data-driven analysis decisions and latency shifts.
Collapse
|
10
|
Abstract
High-throughput technologies such as next-generation sequencing allow biologists to observe cell function with unprecedented resolution, but the resulting datasets are too large and complicated for humans to understand without the aid of advanced statistical methods. Machine learning (ML) algorithms, which are designed to automatically find patterns in data, are well suited to this task. Yet these models are often so complex as to be opaque, leaving researchers with few clues about underlying mechanisms. Interpretable machine learning (iML) is a burgeoning subdiscipline of computational statistics devoted to making the predictions of ML models more intelligible to end users. This article is a gentle and critical introduction to iML, with an emphasis on genomic applications. I define relevant concepts, motivate leading methodologies, and provide a simple typology of existing approaches. I survey recent examples of iML in genomics, demonstrating how such techniques are increasingly integrated into research workflows. I argue that iML solutions are required to realize the promise of precision medicine. However, several open challenges remain. I examine the limitations of current state-of-the-art tools and propose a number of directions for future research. While the horizon for iML in genomics is wide and bright, continued progress requires close collaboration across disciplines.
Collapse
Affiliation(s)
- David S Watson
- Department of Statistical Science, University College London, London, UK.
| |
Collapse
|
11
|
Zhang Y, Bradic J. High-dimensional semi-supervised learning: in search of optimal inference of the mean. Biometrika 2021. [DOI: 10.1093/biomet/asab042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Abstract
A fundamental challenge in semi-supervised learning lies in the observed data’s disproportional size when compared with the size of the data collected with missing outcomes. An implicit understanding is that the dataset with missing outcomes, being significantly larger, ought to improve estimation and inference. However, it is unclear to what extent this is correct. We illustrate one clear benefit: root-n inference of the outcome’s mean is possible while only requiring a consistent estimation of the outcome, possibly at a rate slower than root-n. This is achieved by a novel k-fold cross-fitted, double robust estimator. We discuss both linear and nonlinear outcomes. Such an estimator is particularly suited for models that naturally do not admit root-n consistency, such as high-dimensional, nonparametric, or semiparametric models. We apply our methods to the heterogeneous treatment effects.
Collapse
Affiliation(s)
- Yuqian Zhang
- Department of Mathematics, University of California San Diego, 9500 Gilman Drive, La Jolla, California 92093-0112, U.S.A
| | - Jelena Bradic
- Department of Mathematics, University of California San Diego, 9500 Gilman Drive, La Jolla, California 92093-0112, U.S.A
| |
Collapse
|
12
|
Rathnayake RC, Olive DJ. Bootstrapping some GLM and survival regression variable selection estimators. COMMUN STAT-THEOR M 2021. [DOI: 10.1080/03610926.2021.1955389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Affiliation(s)
- Rasanji C. Rathnayake
- School of Mathematical & Statistical Sciences, Southern Illinois University, Carbondale, Illinois, USA
| | - David J. Olive
- School of Mathematical & Statistical Sciences, Southern Illinois University, Carbondale, Illinois, USA
| |
Collapse
|
13
|
Abstract
AbstractWe propose the conditional predictive impact (CPI), a consistent and unbiased estimator of the association between one or several features and a given outcome, conditional on a reduced feature set. Building on the knockoff framework of Candès et al. (J R Stat Soc Ser B 80:551–577, 2018), we develop a novel testing procedure that works in conjunction with any valid knockoff sampler, supervised learning algorithm, and loss function. The CPI can be efficiently computed for high-dimensional data without any sparsity constraints. We demonstrate convergence criteria for the CPI and develop statistical inference procedures for evaluating its magnitude, significance, and precision. These tests aid in feature and model selection, extending traditional frequentist and Bayesian techniques to general supervised learning tasks. The CPI may also be applied in causal discovery to identify underlying multivariate graph structures. We test our method using various algorithms, including linear regression, neural networks, random forests, and support vector machines. Empirical results show that the CPI compares favorably to alternative variable importance measures and other nonparametric tests of conditional independence on a diverse array of real and synthetic datasets. Simulations confirm that our inference procedures successfully control Type I error with competitive power in a range of settings. Our method has been implemented in an package, , which can be downloaded from https://github.com/dswatson/cpi.
Collapse
|
14
|
Liu L, Mukherjee R, Robins JM. Rejoinder: On nearly assumption-free tests of nominal confidence interval coverage for causal parameters estimated by machine learning. Stat Sci 2020. [DOI: 10.1214/20-sts804] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
15
|
Rinaldo A, Wasserman L, G’Sell M. Bootstrapping and sample splitting for high-dimensional, assumption-lean inference. Ann Stat 2019. [DOI: 10.1214/18-aos1784] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
16
|
Rinaldo A, Tibshirani RJ, Wasserman L. Comment: Statistical Inference from a Predictive Perspective. Stat Sci 2019. [DOI: 10.1214/19-sts748] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|