1
|
Kook L, Lundborg AR. Algorithm-agnostic significance testing in supervised learning with multimodal data. Brief Bioinform 2024; 25:bbae475. [PMID: 39323092 PMCID: PMC11424510 DOI: 10.1093/bib/bbae475] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2024] [Revised: 09/05/2024] [Accepted: 09/10/2024] [Indexed: 09/27/2024] Open
Abstract
MOTIVATION Valid statistical inference is crucial for decision-making but difficult to obtain in supervised learning with multimodal data, e.g. combinations of clinical features, genomic data, and medical images. Multimodal data often warrants the use of black-box algorithms, for instance, random forests or neural networks, which impede the use of traditional variable significance tests. RESULTS We address this problem by proposing the use of COvariance MEasure Tests (COMETs), which are calibrated and powerful tests that can be combined with any sufficiently predictive supervised learning algorithm. We apply COMETs to several high-dimensional, multimodal data sets to illustrate (i) variable significance testing for finding relevant mutations modulating drug-activity, (ii) modality selection for predicting survival in liver cancer patients with multiomics data, and (iii) modality selection with clinical features and medical imaging data. In all applications, COMETs yield results consistent with domain knowledge without requiring data-driven pre-processing, which may invalidate type I error control. These novel applications with high-dimensional multimodal data corroborate prior results on the power and robustness of COMETs for significance testing. AVAILABILITY AND IMPLEMENTATION COMETs are implemented in the cometsR package available on CRAN and pycometsPython library available on GitHub. Source code for reproducing all results is available at https://github.com/LucasKook/comets. All data sets used in this work are openly available.
Collapse
Affiliation(s)
- Lucas Kook
- Institute for Statistics and Mathematics, Vienna University of Economics and Business, Welthandelsplatz 1, AT-1020 Vienna, Austria
| | - Anton Rask Lundborg
- Department of Mathematical Sciences, University of Copenhagen, Universitetsparken 5, DK-2100 Copenhagen, Denmark
| |
Collapse
|
2
|
Lin R, Naselaris T, Kay K, Wehbe L. Stacked regressions and structured variance partitioning for interpretable brain maps. Neuroimage 2024; 298:120772. [PMID: 39117095 DOI: 10.1016/j.neuroimage.2024.120772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2023] [Revised: 07/26/2024] [Accepted: 08/02/2024] [Indexed: 08/10/2024] Open
Abstract
Relating brain activity associated with a complex stimulus to different properties of that stimulus is a powerful approach for constructing functional brain maps. However, when stimuli are naturalistic, their properties are often correlated (e.g., visual and semantic features of natural images, or different layers of a convolutional neural network that are used as features of images). Correlated properties can act as confounders for each other and complicate the interpretability of brain maps, and can impact the robustness of statistical estimators. Here, we present an approach for brain mapping based on two proposed methods: stacking different encoding models and structured variance partitioning. Our stacking algorithm combines encoding models that each uses as input a feature space that describes a different stimulus attribute. The algorithm learns to predict the activity of a voxel as a linear combination of the outputs of different encoding models. We show that the resulting combined model can predict held-out brain activity better or at least as well as the individual encoding models. Further, the weights of the linear combination are readily interpretable; they show the importance of each feature space for predicting a voxel. We then build on our stacking models to introduce structured variance partitioning, a new type of variance partitioning that takes into account the known relationships between features. Our approach constrains the size of the hypothesis space and allows us to ask targeted questions about the similarity between feature spaces and brain regions even in the presence of correlations between the feature spaces. We validate our approach in simulation, showcase its brain mapping potential on fMRI data, and release a Python package. Our methods can be useful for researchers interested in aligning brain activity with different layers of a neural network, or with other types of correlated feature spaces.
Collapse
Affiliation(s)
- Ruogu Lin
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, United States of America
| | - Thomas Naselaris
- Department of Neuroscience, University of Minnesota, Minneapolis, MN 55455, United States of America; Center for Magnetic Resonance Research (CMRR), Department of Radiology, University of Minnesota, Minneapolis, MN 55455, United States of America
| | - Kendrick Kay
- Center for Magnetic Resonance Research (CMRR), Department of Radiology, University of Minnesota, Minneapolis, MN 55455, United States of America
| | - Leila Wehbe
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA 15213, United States of America; Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213, United States of America.
| |
Collapse
|
3
|
Yang Y, Kuchibhotla AK, Tchetgen Tchetgen E. Doubly robust calibration of prediction sets under covariate shift. J R Stat Soc Series B Stat Methodol 2024; 86:943-965. [PMID: 39279914 PMCID: PMC11398884 DOI: 10.1093/jrsssb/qkae009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2022] [Revised: 01/20/2024] [Accepted: 01/28/2024] [Indexed: 09/18/2024]
Abstract
Conformal prediction has received tremendous attention in recent years and has offered new solutions to problems in missing data and causal inference; yet these advances have not leveraged modern semi-parametric efficiency theory for more efficient uncertainty quantification. We consider the problem of obtaining well-calibrated prediction regions that can data adaptively account for a shift in the distribution of covariates between training and test data. Under a covariate shift assumption analogous to the standard missing at random assumption, we propose a general framework based on efficient influence functions to construct well-calibrated prediction regions for the unobserved outcome in the test sample without compromising coverage.
Collapse
Affiliation(s)
- Yachong Yang
- Department of Statistics & Data Science, University of Pennsylvania, Philadelphia, PA, USA
| | - Arun Kumar Kuchibhotla
- Department of Statistics & Data Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Eric Tchetgen Tchetgen
- Department of Statistics & Data Science, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
4
|
Wang S, Yuan B, Tony Cai T, Li H. Phylogenetic association analysis with conditional rank correlation. Biometrika 2024; 111:881-902. [PMID: 39239268 PMCID: PMC11373757 DOI: 10.1093/biomet/asad075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2022] [Indexed: 09/07/2024] Open
Abstract
Phylogenetic association analysis plays a crucial role in investigating the correlation between microbial compositions and specific outcomes of interest in microbiome studies. However, existing methods for testing such associations have limitations related to the assumption of a linear association in high-dimensional settings and the handling of confounding effects. Hence, there is a need for methods capable of characterizing complex associations, including nonmonotonic relationships. This article introduces a novel phylogenetic association analysis framework and associated tests to address these challenges by employing conditional rank correlation as a measure of association. The proposed tests account for confounders in a fully nonparametric manner, ensuring robustness against outliers and the ability to detect diverse dependencies. The proposed framework aggregates conditional rank correlations for subtrees using weighted sum and maximum approaches to capture both dense and sparse signals. The significance level of the test statistics is determined by calibration through a nearest-neighbour bootstrapping method, which is straightforward to implement and can accommodate additional datasets when these are available. The practical advantages of the proposed framework are demonstrated through numerical experiments using both simulated and real microbiome datasets.
Collapse
Affiliation(s)
- Shulei Wang
- Department of Statistics, University of Illinois at Urbana-Champaign, 725 South Wright Street, Champaign, Illinois 61820, U.S.A
| | - Bo Yuan
- Department of Statistics, University of Illinois at Urbana-Champaign, 725 South Wright Street, Champaign, Illinois 61820, U.S.A
| | - T Tony Cai
- Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104, U.S.A
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, U.S.A
| |
Collapse
|
5
|
Liu L, Mukherjee R, Robins JM. Assumption-lean falsification tests of rate double-robustness of double-machine-learning estimators. JOURNAL OF ECONOMETRICS 2024; 240:105500. [PMID: 38680250 PMCID: PMC11052545 DOI: 10.1016/j.jeconom.2023.105500] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/01/2024]
Abstract
The class of doubly robust (DR) functionals studied by Rotnitzky et al. (2021) is of central importance in economics and biostatistics. It strictly includes both (i) the class of mean-square continuous functionals that can be written as an expectation of an affine functional of a conditional expectation studied by Chernozhukov et al. (2022b) and the class of functionals studied by Robins et al. (2008). The present state-of-the-art estimators for DR functionals ψ are double-machine-learning (DML) estimators (Chernozhukov et al., 2018). A DML estimator ψ ^ 1 of ψ depends on estimates p ^ ( x ) and b ^ x of a pair of nuisance functions p ( x ) and b x , and is said to satisfy "rate double-robustness" if the Cauchy-Schwarz upper bound of its bias is o ( n - 1 / 2 ) . Were it achievable, our scientific goal would have been to construct valid, assumption-lean (i.e. no complexity-reducing assumptions on b or p ) tests of the validity of a nominal (1 - α ) Wald confidence interval (CI) centered at ψ ^ 1 . But this would require a test of the bias to be o ( n - 1 / 2 ) , which can be shown not to exist. We therefore adopt the less ambitious goal of falsifying, when possible, an analyst's justification for her claim that the reported (1 - α ) Wald CI is valid. In many instances, an analyst justifies her claim by imposing complexity-reducing assumptions on b and p to ensure "rate double-robustness". Here we exhibit valid, assumption-lean tests of H 0 : "rate double-robustness holds", with non-trivial power against certain alternatives. If H 0 is rejected, we will have falsified her justification. However, no assumption-lean test of H 0 , including ours, can be a consistent test. Thus, the failure of our test to reject is not meaningful evidence in favor of H 0 .
Collapse
Affiliation(s)
- Lin Liu
- Institute of Natural Sciences, MOE-LSC, School of Mathematical Sciences, CMA-Shanghai, SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University; Shanghai Artificial Intelligence Laboratory
| | | | - James M Robins
- Department of Epidemiology and Biostatistics, Harvard University
| |
Collapse
|
6
|
Ding M, Li R, Qin J, Ning J. A double-robust test for high-dimensional gene coexpression networks conditioning on clinical information. Biometrics 2023; 79:3227-3238. [PMID: 37312587 PMCID: PMC10838184 DOI: 10.1111/biom.13890] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2022] [Accepted: 05/18/2023] [Indexed: 06/15/2023]
Abstract
It has been increasingly appealing to evaluate whether expression levels of two genes in a gene coexpression network are still dependent given samples' clinical information, in which the conditional independence test plays an essential role. For enhanced robustness regarding model assumptions, we propose a class of double-robust tests for evaluating the dependence of bivariate outcomes after controlling for known clinical information. Although the proposed test relies on the marginal density functions of bivariate outcomes given clinical information, the test remains valid as long as one of the density functions is correctly specified. Because of the closed-form variance formula, the proposed test procedure enjoys computational efficiency without requiring a resampling procedure or tuning parameters. We acknowledge the need to infer the conditional independence network with high-dimensional gene expressions, and further develop a procedure for multiple testing by controlling the false discovery rate. Numerical results show that our method accurately controls both the type-I error and false discovery rate, and it provides certain levels of robustness regarding model misspecification. We apply the method to a gastric cancer study with gene expression data to understand the associations between genes belonging to the transforming growth factor β signaling pathway given cancer-stage information.
Collapse
Affiliation(s)
- Maomao Ding
- Meta Platforms, Inc., Menlo Park, California, USA
| | - Ruosha Li
- Department of Biostatistics and Data Science, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Jin Qin
- Biostatistics Research Branch, National Institute of Allergy and Infectious Diseases, Bethesda, Maryland, USA
| | - Jing Ning
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| |
Collapse
|
7
|
Laumann F, von Kügelgen J, Park J, Schölkopf B, Barahona M. Kernel-Based Independence Tests for Causal Structure Learning on Functional Data. ENTROPY (BASEL, SWITZERLAND) 2023; 25:1597. [PMID: 38136477 PMCID: PMC10742995 DOI: 10.3390/e25121597] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 11/11/2023] [Accepted: 11/15/2023] [Indexed: 12/24/2023]
Abstract
Measurements of systems taken along a continuous functional dimension, such as time or space, are ubiquitous in many fields, from the physical and biological sciences to economics and engineering. Such measurements can be viewed as realisations of an underlying smooth process sampled over the continuum. However, traditional methods for independence testing and causal learning are not directly applicable to such data, as they do not take into account the dependence along the functional dimension. By using specifically designed kernels, we introduce statistical tests for bivariate, joint, and conditional independence for functional variables. Our method not only extends the applicability to functional data of the Hilbert-Schmidt independence criterion (hsic) and its d-variate version (d-hsic), but also allows us to introduce a test for conditional independence by defining a novel statistic for the conditional permutation test (cpt) based on the Hilbert-Schmidt conditional independence criterion (hscic), with optimised regularisation strength estimated through an evaluation rejection rate. Our empirical results of the size and power of these tests on synthetic functional data show good performance, and we then exemplify their application to several constraint- and regression-based causal structure learning problems, including both synthetic examples and real socioeconomic data.
Collapse
Affiliation(s)
- Felix Laumann
- Department of Mathematics, Imperial College London, London SW7 2BX, UK
| | - Julius von Kügelgen
- Max Planck Institute for Intelligent Systems, 72076 Tübingen, Germany
- Department of Engineering, University of Cambridge, Cambridge CB2 0QQ, UK
| | - Junhyung Park
- Max Planck Institute for Intelligent Systems, 72076 Tübingen, Germany
| | | | - Mauricio Barahona
- Department of Mathematics, Imperial College London, London SW7 2BX, UK
| |
Collapse
|
8
|
Petersen AH, Ekstrøm CT, Spirtes P, Osler M. Constructing Causal Life-Course Models: Comparative Study of Data-Driven and Theory-Driven Approaches. Am J Epidemiol 2023; 192:1917-1927. [PMID: 37344193 PMCID: PMC11004942 DOI: 10.1093/aje/kwad144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 04/04/2023] [Accepted: 06/19/2023] [Indexed: 06/23/2023] Open
Abstract
Life-course epidemiology relies on specifying complex (causal) models that describe how variables interplay over time. Traditionally, such models have been constructed by perusing existing theory and previous studies. By comparing data-driven and theory-driven models, we investigated whether data-driven causal discovery algorithms can help in this process. We focused on a longitudinal data set on a cohort of Danish men (the Metropolit Study, 1953-2017). The theory-driven models were constructed by 2 subject-field experts. The data-driven models were constructed by use of the temporal Peter-Clark (TPC) algorithm. The TPC algorithm utilizes the temporal information embedded in life-course data. We found that the data-driven models recovered some, but not all, causal relationships included in the theory-driven expert models. The data-driven method was especially good at identifying direct causal relationships that the experts had high confidence in. Moreover, in a post hoc assessment, we found that most of the direct causal relationships proposed by the data-driven model but not included in the theory-driven model were plausible. Thus, the data-driven model may propose additional meaningful causal hypotheses that are new or have been overlooked by the experts. In conclusion, data-driven methods can aid causal model construction in life-course epidemiology, and combining both data-driven and theory-driven methods can lead to even stronger models.
Collapse
Affiliation(s)
- Anne Helby Petersen
- Correspondence to Dr. Anne Helby Petersen, Section of Biostatistics, Department of Public Health, Faculty of Health and Medical Sciences, University of Copenhagen, Øster Farimagsgade 5, 1353 Copenhagen K, Denmark (e-mail: )
| | | | | | | |
Collapse
|
9
|
Qiu H, Dobriban E, Tchetgen Tchetgen E. Prediction sets adaptive to unknown covariate shift. J R Stat Soc Series B Stat Methodol 2023; 85:1680-1705. [PMID: 38312527 PMCID: PMC10837005 DOI: 10.1093/jrsssb/qkad069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Revised: 06/17/2023] [Accepted: 06/20/2023] [Indexed: 02/06/2024]
Abstract
Predicting sets of outcomes-instead of unique outcomes-is a promising solution to uncertainty quantification in statistical learning. Despite a rich literature on constructing prediction sets with statistical guarantees, adapting to unknown covariate shift-a prevalent issue in practice-poses a serious unsolved challenge. In this article, we show that prediction sets with finite-sample coverage guarantee are uninformative and propose a novel flexible distribution-free method, PredSet-1Step, to efficiently construct prediction sets with an asymptotic coverage guarantee under unknown covariate shift. We formally show that our method is asymptotically probably approximately correct, having well-calibrated coverage error with high confidence for large samples. We illustrate that it achieves nominal coverage in a number of experiments and a data set concerning HIV risk prediction in a South African cohort study. Our theory hinges on a new bound for the convergence rate of the coverage of Wald confidence intervals based on general asymptotically linear estimators.
Collapse
Affiliation(s)
- Hongxiang Qiu
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Edgar Dobriban
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Eric Tchetgen Tchetgen
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
10
|
Györfi L, Linder T, Walk H. Lossless Transformations and Excess Risk Bounds in Statistical Inference. ENTROPY (BASEL, SWITZERLAND) 2023; 25:1394. [PMID: 37895515 PMCID: PMC10606681 DOI: 10.3390/e25101394] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/02/2023] [Revised: 09/26/2023] [Accepted: 09/26/2023] [Indexed: 10/29/2023]
Abstract
We study the excess minimum risk in statistical inference, defined as the difference between the minimum expected loss when estimating a random variable from an observed feature vector and the minimum expected loss when estimating the same random variable from a transformation (statistic) of the feature vector. After characterizing lossless transformations, i.e., transformations for which the excess risk is zero for all loss functions, we construct a partitioning test statistic for the hypothesis that a given transformation is lossless, and we show that for i.i.d. data the test is strongly consistent. More generally, we develop information-theoretic upper bounds on the excess risk that uniformly hold over fairly general classes of loss functions. Based on these bounds, we introduce the notion of a δ-lossless transformation and give sufficient conditions for a given transformation to be universally δ-lossless. Applications to classification, nonparametric regression, portfolio strategies, information bottlenecks, and deep learning are also surveyed.
Collapse
Affiliation(s)
- László Györfi
- Department of Computer Science and Information Theory, Budapest University of Technology and Economics, H-1111 Budapest, Hungary;
| | - Tamás Linder
- Department of Mathematics and Statistics, Queen’s University, Kingston, ON K7L 3N6, Canada
| | - Harro Walk
- Fachbereich Mathematik, Universität Stuttgart, 70569 Stuttgart, Germany;
| |
Collapse
|
11
|
Shi C, Zhou Y, Li L. Testing Directed Acyclic Graph via Structural, Supervised and Generative Adversarial Learning. J Am Stat Assoc 2023; 119:1833-1846. [PMID: 39416711 PMCID: PMC11474452 DOI: 10.1080/01621459.2023.2220169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2021] [Revised: 03/02/2023] [Accepted: 05/21/2023] [Indexed: 10/19/2024]
Abstract
In this article, we propose a new hypothesis testing method for directed acyclic graph (DAG). While there is a rich class of DAG estimation methods, there is a relative paucity of DAG inference solutions. Moreover, the existing methods often impose some specific model structures such as linear models or additive models, and assume independent data observations. Our proposed test instead allows the associations among the random variables to be nonlinear and the data to be time-dependent. We build the test based on some highly flexible neural networks learners. We establish the asymptotic guarantees of the test, while allowing either the number of subjects or the number of time points for each subject to diverge to infinity. We demonstrate the efficacy of the test through simulations and a brain connectivity network analysis.
Collapse
Affiliation(s)
| | | | - Lexin Li
- University of California at Berkeley
| |
Collapse
|
12
|
Wen Y, Huang J, Guo S, Elyahu Y, Monsonego A, Zhang H, Ding Y, Zhu H. Applying causal discovery to single-cell analyses using CausalCell. eLife 2023; 12:e81464. [PMID: 37129360 PMCID: PMC10229139 DOI: 10.7554/elife.81464] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2022] [Accepted: 05/01/2023] [Indexed: 05/03/2023] Open
Abstract
Correlation between objects is prone to occur coincidentally, and exploring correlation or association in most situations does not answer scientific questions rich in causality. Causal discovery (also called causal inference) infers causal interactions between objects from observational data. Reported causal discovery methods and single-cell datasets make applying causal discovery to single cells a promising direction. However, evaluating and choosing causal discovery methods and developing and performing proper workflow remain challenges. We report the workflow and platform CausalCell (http://www.gaemons.net/causalcell/causalDiscovery/) for performing single-cell causal discovery. The workflow/platform is developed upon benchmarking four kinds of causal discovery methods and is examined by analyzing multiple single-cell RNA-sequencing (scRNA-seq) datasets. Our results suggest that different situations need different methods and the constraint-based PC algorithm with kernel-based conditional independence tests work best in most situations. Related issues are discussed and tips for best practices are given. Inferred causal interactions in single cells provide valuable clues for investigating molecular interactions and gene regulations, identifying critical diagnostic and therapeutic targets, and designing experimental and clinical interventions.
Collapse
Affiliation(s)
- Yujian Wen
- Bioinformatics Section, School of Basic Medical Sciences, Southern Medical UniversityGuangzhouChina
| | - Jielong Huang
- Bioinformatics Section, School of Basic Medical Sciences, Southern Medical UniversityGuangzhouChina
| | - Shuhui Guo
- Bioinformatics Section, School of Basic Medical Sciences, Southern Medical UniversityGuangzhouChina
| | - Yehezqel Elyahu
- The Shraga Segal Department of Microbiology, Immunology and Genetics, Faculty of Health Sciences, Ben-Gurion University of the NegevBeer-ShevaIsrael
| | - Alon Monsonego
- The Shraga Segal Department of Microbiology, Immunology and Genetics, Faculty of Health Sciences, Ben-Gurion University of the NegevBeer-ShevaIsrael
| | - Hai Zhang
- Network Center, Southern Medical UniversityGuangzhouChina
| | - Yanqing Ding
- Department of Pathology, School of Basic Medical Sciences, Southern Medical UniversityGuangzhouChina
| | - Hao Zhu
- Bioinformatics Section, School of Basic Medical Sciences, Southern Medical UniversityGuangzhouChina
- Guangdong-Hong Kong-Macao Greater Bay Area Center for Brain Science and Brain-Inspired Intelligence, Southern Medical UniversityGuangzhouChina
- Guangdong Provincial Key Lab of Single Cell Technology and Application, Southern Medical UniversityGuangzhouChina
| |
Collapse
|
13
|
Zhang B, Suzuki J. Extending Hilbert-Schmidt Independence Criterion for Testing Conditional Independence. ENTROPY (BASEL, SWITZERLAND) 2023; 25:425. [PMID: 36981314 PMCID: PMC10047653 DOI: 10.3390/e25030425] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Revised: 02/22/2023] [Accepted: 02/25/2023] [Indexed: 06/18/2023]
Abstract
The Conditional Independence (CI) test is a fundamental problem in statistics. Many nonparametric CI tests have been developed, but a common challenge exists: the current methods perform poorly with a high-dimensional conditioning set. In this paper, we considered a nonparametric CI test using a kernel-based test statistic, which can be viewed as an extension of the Hilbert-Schmidt Independence Criterion (HSIC). We propose a local bootstrap method to generate samples from the null distribution H0:X⫫Y∣Z. The experimental results showed that our proposed method led to a significant performance improvement compared with previous methods. In particular, our method performed well against the growth of the dimension of the conditioning set. Meanwhile, our method can be computed efficiently against the growth of the sample size and the dimension of the conditioning set.
Collapse
|
14
|
Learning to increase the power of conditional randomization tests. Mach Learn 2023. [DOI: 10.1007/s10994-023-06302-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
15
|
Kim I, Neykov M, Balakrishnan S, Wasserman L. Local permutation tests for conditional independence. Ann Stat 2022. [DOI: 10.1214/22-aos2233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Affiliation(s)
- Ilmun Kim
- Department of Statistics and Data Science, Department of Applied Statistics, Yonsei University
| | - Matey Neykov
- Department of Statistics and Data Science, Carnegie Mellon University
| | | | - Larry Wasserman
- Department of Statistics and Data Science, Carnegie Mellon University
| |
Collapse
|
16
|
Lundborg AR, Shah RD, Peters J. Conditional independence testing in Hilbert spaces with applications to functional data analysis. J R Stat Soc Series B Stat Methodol 2022. [DOI: 10.1111/rssb.12544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
17
|
Zan L, Meynaoui A, Assaad CK, Devijver E, Gaussier E. A Conditional Mutual Information Estimator for Mixed Data and an Associated Conditional Independence Test. ENTROPY (BASEL, SWITZERLAND) 2022; 24:1234. [PMID: 36141120 PMCID: PMC9498172 DOI: 10.3390/e24091234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 08/26/2022] [Accepted: 08/31/2022] [Indexed: 06/16/2023]
Abstract
In this study, we focus on mixed data which are either observations of univariate random variables which can be quantitative or qualitative, or observations of multivariate random variables such that each variable can include both quantitative and qualitative components. We first propose a novel method, called CMIh, to estimate conditional mutual information taking advantages of the previously proposed approaches for qualitative and quantitative data. We then introduce a new local permutation test, called LocAT for local adaptive test, which is well adapted to mixed data. Our experiments illustrate the good behaviour of CMIh and LocAT, and show their respective abilities to accurately estimate conditional mutual information and to detect conditional (in)dependence for mixed data.
Collapse
Affiliation(s)
- Lei Zan
- Department of Mathematics, Information and Communication Sciences, Université Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France
- R&D Department, EasyVista, 38000 Grenoble, France
| | - Anouar Meynaoui
- Department of Mathematics, Information and Communication Sciences, Université Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France
| | | | - Emilie Devijver
- Department of Mathematics, Information and Communication Sciences, Université Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France
| | - Eric Gaussier
- Department of Mathematics, Information and Communication Sciences, Université Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France
| |
Collapse
|
18
|
Spisak T. Statistical quantification of confounding bias in machine learning models. Gigascience 2022; 11:giac082. [PMID: 36017878 PMCID: PMC9412867 DOI: 10.1093/gigascience/giac082] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Revised: 07/07/2022] [Accepted: 07/28/2022] [Indexed: 11/12/2022] Open
Abstract
BACKGROUND The lack of nonparametric statistical tests for confounding bias significantly hampers the development of robust, valid, and generalizable predictive models in many fields of research. Here I propose the partial confounder test, which, for a given confounder variable, probes the null hypotheses of the model being unconfounded. RESULTS The test provides a strict control for type I errors and high statistical power, even for nonnormally and nonlinearly dependent predictions, often seen in machine learning. Applying the proposed test on models trained on large-scale functional brain connectivity data (N= 1,865) (i) reveals previously unreported confounders and (ii) shows that state-of-the-art confound mitigation approaches may fail preventing confounder bias in several cases. CONCLUSIONS The proposed test (implemented in the package mlconfound; https://mlconfound.readthedocs.io) can aid the assessment and improvement of the generalizability and validity of predictive models and, thereby, fosters the development of clinically useful machine learning biomarkers.
Collapse
Affiliation(s)
- Tamas Spisak
- Center for Translational Neuro- and Behavioral Sciences, Institute for Diagnostic and Interventional Radiology and Neuroradiology, Center University Hospital Essen, Essen, D-45147, Germany
| |
Collapse
|
19
|
Mielniczuk J. Information Theoretic Methods for Variable Selection-A Review. ENTROPY (BASEL, SWITZERLAND) 2022; 24:1079. [PMID: 36010742 PMCID: PMC9407310 DOI: 10.3390/e24081079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 06/24/2022] [Revised: 08/02/2022] [Accepted: 08/02/2022] [Indexed: 02/05/2023]
Abstract
We review the principal information theoretic tools and their use for feature selection, with the main emphasis on classification problems with discrete features. Since it is known that empirical versions of conditional mutual information perform poorly for high-dimensional problems, we focus on various ways of constructing its counterparts and the properties and limitations of such methods. We present a unified way of constructing such measures based on truncation, or truncation and weighing, for the Möbius expansion of conditional mutual information. We also discuss the main approaches to feature selection which apply the introduced measures of conditional dependence, together with the ways of assessing the quality of the obtained vector of predictors. This involves discussion of recent results on asymptotic distributions of empirical counterparts of criteria, as well as advances in resampling.
Collapse
Affiliation(s)
- Jan Mielniczuk
- Institute of Computer Science, Polish Academy of Sciences, Jana Kazimierza 5, 01-248 Warsaw, Poland;
- Faculty of Mathematics and Information Science, Warsaw University of Technology, Koszykowa 75, 00-662 Warsaw, Poland
| |
Collapse
|
20
|
Squires C, Uhler C. Causal Structure Learning: A Combinatorial Perspective. FOUNDATIONS OF COMPUTATIONAL MATHEMATICS (NEW YORK, N.Y.) 2022; 23:1-35. [PMID: 35935470 PMCID: PMC9342837 DOI: 10.1007/s10208-022-09581-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/28/2022] [Accepted: 06/08/2022] [Indexed: 05/29/2023]
Abstract
In this review, we discuss approaches for learning causal structure from data, also called causal discovery. In particular, we focus on approaches for learning directed acyclic graphs and various generalizations which allow for some variables to be unobserved in the available data. We devote special attention to two fundamental combinatorial aspects of causal structure learning. First, we discuss the structure of the search space over causal graphs. Second, we discuss the structure of equivalence classes over causal graphs, i.e., sets of graphs which represent what can be learned from observational data alone, and how these equivalence classes can be refined by adding interventional data.
Collapse
Affiliation(s)
| | - Caroline Uhler
- Broad Institute and Massachusetts Institute of Technology, Cambridge, MA 02139 USA
| |
Collapse
|
21
|
Shi H, Hallin M, Drton M, Han F. On universally consistent and fully distribution-free rank tests of vector independence. Ann Stat 2022. [DOI: 10.1214/21-aos2151] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Hongjian Shi
- Department of Mathematics, Technical University of Munich
| | - Marc Hallin
- ECARES and Department of Mathematics, Université Libre de Bruxelles
| | - Mathias Drton
- Department of Mathematics, Technical University of Munich
| | - Fang Han
- Department of Statistics, University of Washington
| |
Collapse
|
22
|
On Falsifiable Statistical Hypotheses. PHILOSOPHIES 2022. [DOI: 10.3390/philosophies7020040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Popper argued that a statistical falsification required a prior methodological decision to regard sufficiently improbable events as ruled out. That suggestion has generated a number of fruitful approaches, but also a number of apparent paradoxes and ultimately, no clear consensus. It is still commonly claimed that, since random samples are logically consistent with all the statistical hypotheses on the table, falsification simply does not apply in realistic statistical settings. We claim that the situation is considerably improved if we ask a conceptually prior question: when should a statistical hypothesis be regarded as falsifiable. To that end we propose several different notions of statistical falsifiability and prove that, whichever definition we prefer, the same hypotheses turn out to be falsifiable. That shows that statistical falsifiability enjoys a kind of conceptual robustness. These notions of statistical falsifiability are arrived at by proposing statistical analogues to intuitive properties enjoyed by exemplary falsifiable hypotheses familiar from classical philosophy of science. That demonstrates that, to a large extent, this philosophical tradition was on the right conceptual track. Finally, we demonstrate that, under weak assumptions, the statistically falsifiable hypotheses correspond precisely to the closed sets in a standard topology on probability measures. That means that standard techniques from statistics and measure theory can be used to determine exactly which hypotheses are statistically falsifiable. In other words: the proposed notion of statistical falsifiability both answers to our conceptual demands and submits to standard mathematical techniques.
Collapse
|
23
|
Causal discovery with a mixture of DAGs. Mach Learn 2022. [DOI: 10.1007/s10994-022-06159-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
24
|
Sun J, Wu C, Peng W, Huang J, Han C, Zhu Y, Lyu Y. Mining human preference via self-correction causal structure learning. Sci Rep 2022; 12:5051. [PMID: 35322096 PMCID: PMC8942159 DOI: 10.1038/s41598-022-08879-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Accepted: 03/15/2022] [Indexed: 01/11/2023] Open
Abstract
Spurred by causal structure learning (CSL) ability to reveal the cause-effect connection, significant research efforts have been made to enhance the scalability of CSL algorithms in various artificial intelligence applications. However, less effort has been made regarding the stability and the interpretability of CSL algorithms. Thus, this work proposes a self-correction mechanism that embeds domain knowledge for CSL, improving the stability and accuracy even in low-dimensional but high-noise environments by guaranteeing a meaningful output. The suggested algorithm is challenged against multiple classic and influential CSL algorithms in synthesized and field datasets. Our algorithm achieves a superior accuracy on the synthesized dataset, while on the field dataset, our method interprets the learned causal structure as a human preference for investment, coinciding with domain expert analysis.
Collapse
Affiliation(s)
- Jian Sun
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
| | - Chenye Wu
- School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, Shenzhen, Guangdong, China.
- Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen, Guangdong, China.
| | | | | | | | | | | |
Collapse
|
25
|
Hines O, Dukes O, Diaz-Ordaz K, Vansteelandt S. Demystifying statistical learning based on efficient influence functions. AM STAT 2022. [DOI: 10.1080/00031305.2021.2021984] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Affiliation(s)
- Oliver Hines
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, UK
| | - Oliver Dukes
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium
| | - Karla Diaz-Ordaz
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, UK
| | - Stijn Vansteelandt
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, UK
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium
| |
Collapse
|
26
|
Katsevich E, Ramdas A. On the power of conditional independence testing under model-X. Electron J Stat 2022. [DOI: 10.1214/22-ejs2085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Affiliation(s)
- Eugene Katsevich
- Department of Statistics and Data Science, University of Pennsylvania
| | - Aaditya Ramdas
- Department of Statistics and Data Science, Carnegie Mellon University, Machine Learning Department, Carnegie Mellon University
| |
Collapse
|
27
|
Shah RD, Bühlmann P. Double-Estimation-Friendly Inference for High-Dimensional Misspecified Models. Stat Sci 2022. [DOI: 10.1214/22-sts850] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Affiliation(s)
- Rajen D. Shah
- Rajen D. Shah is Professor of Statistics, Statistical Laboratory, University of Cambridge, Cambridge, United Kingdom
| | - Peter Bühlmann
- Peter Bühlmann is Professor of Statistics, Seminar for Statistics, ETH Zurich, Zurich, Switzerland
| |
Collapse
|
28
|
|
29
|
Berrett TB, Kontoyiannis I, Samworth RJ. Optimal rates for independence testing via U-statistic permutation tests. Ann Stat 2021. [DOI: 10.1214/20-aos2041] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
30
|
Lecca P. Machine Learning for Causal Inference in Biological Networks: Perspectives of This Challenge. FRONTIERS IN BIOINFORMATICS 2021; 1:746712. [PMID: 36303798 PMCID: PMC9581010 DOI: 10.3389/fbinf.2021.746712] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2021] [Accepted: 09/08/2021] [Indexed: 11/13/2022] Open
Abstract
Most machine learning-based methods predict outcomes rather than understanding causality. Machine learning methods have been proved to be efficient in finding correlations in data, but unskilful to determine causation. This issue severely limits the applicability of machine learning methods to infer the causal relationships between the entities of a biological network, and more in general of any dynamical system, such as medical intervention strategies and clinical outcomes system, that is representable as a network. From the perspective of those who want to use the results of network inference not only to understand the mechanisms underlying the dynamics, but also to understand how the network reacts to external stimuli (e. g. environmental factors, therapeutic treatments), tools that can understand the causal relationships between data are highly demanded. Given the increasing popularity of machine learning techniques in computational biology and the recent literature proposing the use of machine learning techniques for the inference of biological networks, we would like to present the challenges that mathematics and computer science research faces in generalising machine learning to an approach capable of understanding causal relationships, and the prospects that achieving this will open up for the medical application domains of systems biology, the main paradigm of which is precisely network biology at any physical scale.
Collapse
Affiliation(s)
- Paola Lecca
- Faculty of Computer Science, Free University of Bozen-Bolzano, Piazza Domenicani, Bolzano, Italy
| |
Collapse
|
31
|
Petersen AH, Osler M, Ekstrøm CT. Data-Driven Model Building for Life-Course Epidemiology. Am J Epidemiol 2021; 190:1898-1907. [PMID: 33778840 DOI: 10.1093/aje/kwab087] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2020] [Revised: 03/23/2021] [Accepted: 03/23/2021] [Indexed: 01/15/2023] Open
Abstract
Life-course epidemiology is useful for describing and analyzing complex etiological mechanisms for disease development, but existing statistical methods are essentially confirmatory, because they rely on a priori model specification. This limits the scope of causal inquiries that can be made, because these methods are suited mostly to examine well-known hypotheses that do not question our established view of health, which could lead to confirmation bias. We propose an exploratory alternative. Instead of specifying a life-course model prior to data analysis, our method infers the life-course model directly from the data. Our proposed method extends the well-known Peter-Clark (PC) algorithm (named after its authors) for causal discovery, and it facilitates including temporal information for inferring a model from observational data. The extended algorithm is called temporal PC. The obtained life-course model can afterward be perused for interesting causal hypotheses. Our method complements classical confirmatory methods and guides researchers in expanding their models in new directions. We showcase the method using a data set encompassing almost 3,000 Danish men followed from birth until age 65 years. Using this data set, we inferred life-course models for the role of socioeconomic and health-related factors on development of depression.
Collapse
|
32
|
Abstract
AbstractWe propose the conditional predictive impact (CPI), a consistent and unbiased estimator of the association between one or several features and a given outcome, conditional on a reduced feature set. Building on the knockoff framework of Candès et al. (J R Stat Soc Ser B 80:551–577, 2018), we develop a novel testing procedure that works in conjunction with any valid knockoff sampler, supervised learning algorithm, and loss function. The CPI can be efficiently computed for high-dimensional data without any sparsity constraints. We demonstrate convergence criteria for the CPI and develop statistical inference procedures for evaluating its magnitude, significance, and precision. These tests aid in feature and model selection, extending traditional frequentist and Bayesian techniques to general supervised learning tasks. The CPI may also be applied in causal discovery to identify underlying multivariate graph structures. We test our method using various algorithms, including linear regression, neural networks, random forests, and support vector machines. Empirical results show that the CPI compares favorably to alternative variable importance measures and other nonparametric tests of conditional independence on a diverse array of real and synthetic datasets. Simulations confirm that our inference procedures successfully control Type I error with competitive power in a range of settings. Our method has been implemented in an package, , which can be downloaded from https://github.com/dswatson/cpi.
Collapse
|
33
|
Neykov M, Balakrishnan S, Wasserman L. Minimax optimal conditional independence testing. Ann Stat 2021. [DOI: 10.1214/20-aos2030] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Matey Neykov
- Department of Statistics & Data Science, Carnegie Mellon University
| | | | - Larry Wasserman
- Department of Statistics & Data Science, Carnegie Mellon University
| |
Collapse
|
34
|
Gao L, Fan Y, Lv J, Shao QM. ASYMPTOTIC DISTRIBUTIONS OF HIGH-DIMENSIONAL DISTANCE CORRELATION INFERENCE. Ann Stat 2021; 49:1999-2020. [PMID: 34621096 PMCID: PMC8491772 DOI: 10.1214/20-aos2024] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Distance correlation has become an increasingly popular tool for detecting the nonlinear dependence between a pair of potentially high-dimensional random vectors. Most existing works have explored its asymptotic distributions under the null hypothesis of independence between the two random vectors when only the sample size or the dimensionality diverges. Yet its asymptotic null distribution for the more realistic setting when both sample size and dimensionality diverge in the full range remains largely underdeveloped. In this paper, we fill such a gap and develop central limit theorems and associated rates of convergence for a rescaled test statistic based on the bias-corrected distance correlation in high dimensions under some mild regularity conditions and the null hypothesis. Our new theoretical results reveal an interesting phenomenon of blessing of dimensionality for high-dimensional distance correlation inference in the sense that the accuracy of normal approximation can increase with dimensionality. Moreover, we provide a general theory on the power analysis under the alternative hypothesis of dependence, and further justify the capability of the rescaled distance correlation in capturing the pure nonlinear dependency under moderately high dimensionality for a certain type of alternative hypothesis. The theoretical results and finite-sample performance of the rescaled statistic are illustrated with several simulation examples and a blockchain application.
Collapse
Affiliation(s)
- Lan Gao
- Data Sciences and Operations Department, Marshall School of Business, University of Southern California
| | - Yingying Fan
- Data Sciences and Operations Department, Marshall School of Business, University of Southern California
| | - Jinchi Lv
- Data Sciences and Operations Department, Marshall School of Business, University of Southern California
| | - Qi-Man Shao
- Department of Statistics and Data Science, Southern University of Science and Technology
- Department of Statistics, The Chinese University of Hong Kong
| |
Collapse
|
35
|
Sazal M, Stebliankin V, Mathee K, Yoo C, Narasimhan G. Causal effects in microbiomes using interventional calculus. Sci Rep 2021; 11:5724. [PMID: 33707536 PMCID: PMC7970971 DOI: 10.1038/s41598-021-84905-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Accepted: 02/23/2021] [Indexed: 01/31/2023] Open
Abstract
Causal inference in biomedical research allows us to shift the paradigm from investigating associational relationships to causal ones. Inferring causal relationships can help in understanding the inner workings of biological processes. Association patterns can be coincidental and may lead to wrong conclusions about causality in complex systems. Microbiomes are highly complex, diverse, and dynamic environments. Microbes are key players in human health and disease. Hence knowledge of critical causal relationships among the entities in a microbiome, and the impact of internal and external factors on microbial abundance and their interactions are essential for understanding disease mechanisms and making appropriate treatment recommendations. In this paper, we employ causal inference techniques to understand causal relationships between various entities in a microbiome, and to use the resulting causal network to make useful computations. We introduce a novel pipeline for microbiome analysis, which includes adding an outcome or "disease" variable, and then computing the causal network, referred to as a "disease network", with the goal of identifying disease-relevant causal factors from the microbiome. Internventional techniques are then applied to the resulting network, allowing us to compute a measure called the causal effect of one or more microbial taxa on the outcome variable or the condition of interest. Finally, we propose a measure called causal influence that quantifies the total influence exerted by a microbial taxon on the rest of the microiome. Our pipeline is robust, sensitive, different from traditional approaches, and able to predict interventional effects without any controlled experiments. The pipeline can be used to identify potential eubiotic and dysbiotic microbial taxa in a microbiome. We validate our results using synthetic data sets and using results on real data sets that were previously published.
Collapse
Affiliation(s)
- Musfiqur Sazal
- grid.65456.340000 0001 2110 1845Bioinformatics Research Group (BioRG), Florida International University, Miami, 33199 USA
| | - Vitalii Stebliankin
- grid.65456.340000 0001 2110 1845Bioinformatics Research Group (BioRG), Florida International University, Miami, 33199 USA
| | - Kalai Mathee
- grid.65456.340000 0001 2110 1845Herbert Wertheim College of Medicine, Florida International University, Miami, 33199 USA ,grid.65456.340000 0001 2110 1845Biomolecular Sciences Institute, Florida International University, Miami, 33199 USA
| | - Changwon Yoo
- grid.65456.340000 0001 2110 1845Department of Biostatistics, Florida International University, Miami, 33199 USA
| | - Giri Narasimhan
- grid.65456.340000 0001 2110 1845Bioinformatics Research Group (BioRG), Florida International University, Miami, 33199 USA ,grid.65456.340000 0001 2110 1845Biomolecular Sciences Institute, Florida International University, Miami, 33199 USA
| |
Collapse
|
36
|
|
37
|
Raket LL, Kühnel L, Schmidt E, Blennow K, Zetterberg H, Mattsson-Carlgren N. Utility of plasma neurofilament light and total tau for clinical trials in Alzheimer's disease. ALZHEIMER'S & DEMENTIA (AMSTERDAM, NETHERLANDS) 2020; 12:e12099. [PMID: 32995466 PMCID: PMC7507310 DOI: 10.1002/dad2.12099] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/08/2020] [Revised: 07/30/2020] [Accepted: 07/30/2020] [Indexed: 11/07/2022]
Abstract
INTRODUCTION Several blood-based biomarkers are associated with neuronal injury, but their utility in interventional clinical trials is unclear. This study retrospectively evaluated the utility of plasma neurofilament light (NfL) and total tau (t-tau) in an 18-month trial in mild Alzheimer's disease (AD). METHODS Correlation and conditional independence analyses and Gaussian graphical models were used to investigate cross-sectional and longitudinal relations between NfL, t-tau, and clinical scales. RESULTS NfL had a stronger association than t-tau with clinical scales; t-tau did not hold additional information to that given by NfL (P > 0.05 at all time points). NfL held independent information about shorter-term (3- to 6-month) progression beyond patient age and clinical scores. However, no meaningful gain in power was found when adjusting a longitudinal analysis of cognitive scores for baseline NfL. DISCUSSION Plasma NfL is superior to t-tau in mild AD. The ability of NfL to detect changes before clinical manifestations makes it a promising biomarker of drug response in trials of disease-modifying drugs.
Collapse
Affiliation(s)
- Lars Lau Raket
- H. Lundbeck A/S Valby Denmark
- Clinical Memory Research Unit Department of Clinical Sciences Lund University Malmö Sweden
| | - Line Kühnel
- H. Lundbeck A/S Valby Denmark
- Department of Mathematical Sciences University of Copenhagen Copenhagen Denmark
| | | | - Kaj Blennow
- Clinical Neurochemistry Laboratory Sahlgrenska University Hospital Mölndal Sweden
- Department of Psychiatry and Neurochemistry Institute of Neuroscience and Physiology the Sahlgrenska Academy at the University of Gothenburg Mölndal Sweden
| | - Henrik Zetterberg
- Clinical Neurochemistry Laboratory Sahlgrenska University Hospital Mölndal Sweden
- Department of Psychiatry and Neurochemistry Institute of Neuroscience and Physiology the Sahlgrenska Academy at the University of Gothenburg Mölndal Sweden
- Department of Neurodegenerative Disease UCL Institute of Neurology London UK
- UK Dementia Research Institute at UCL London UK
| | - Niklas Mattsson-Carlgren
- Clinical Memory Research Unit Department of Clinical Sciences Lund University Malmö Sweden
- Department of Neurology Skåne University Hospital Lund Sweden
- Wallenberg Centre for Molecular Medicine Lund University Lund Sweden
| |
Collapse
|
38
|
Weichwald S, Peters J. Causality in Cognitive Neuroscience: Concepts, Challenges, and Distributional Robustness. J Cogn Neurosci 2020; 33:226-247. [PMID: 32812827 DOI: 10.1162/jocn_a_01623] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
Whereas probabilistic models describe the dependence structure between observed variables, causal models go one step further: They predict, for example, how cognitive functions are affected by external interventions that perturb neuronal activity. In this review and perspective article, we introduce the concept of causality in the context of cognitive neuroscience and review existing methods for inferring causal relationships from data. Causal inference is an ambitious task that is particularly challenging in cognitive neuroscience. We discuss two difficulties in more detail: the scarcity of interventional data and the challenge of finding the right variables. We argue for distributional robustness as a guiding principle to tackle these problems. Robustness (or invariance) is a fundamental principle underlying causal methodology. A (correctly specified) causal model of a target variable generalizes across environments or subjects as long as these environments leave the causal mechanisms of the target intact. Consequently, if a candidate model does not generalize, then either it does not consist of the target variable's causes or the underlying variables do not represent the correct granularity of the problem. In this sense, assessing generalizability may be useful when defining relevant variables and can be used to partially compensate for the lack of interventional data.
Collapse
|
39
|
Guo FR, Richardson TS. On testing marginal versus conditional independence. Biometrika 2020. [DOI: 10.1093/biomet/asaa040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Summary
We consider testing marginal independence versus conditional independence in a trivariate Gaussian setting. The two models are nonnested, and their intersection is a union of two marginal independences. We consider two sequences of such models, one from each type of independence, that are closest to each other in the Kullback–Leibler sense as they approach the intersection. They become indistinguishable if the signal strength, as measured by the product of two correlation parameters, decreases faster than the standard parametric rate. Under local alternatives at such a rate, we show that the asymptotic distribution of the likelihood ratio depends on where and how the local alternatives approach the intersection. To deal with this nonuniformity, we study a class of envelope distributions by taking pointwise suprema over asymptotic cumulative distribution functions. We show that these envelope distributions are well behaved and lead to model selection procedures with rate-free uniform error guarantees and near-optimal power. To control the error even when the two models are indistinguishable, rather than insist on a dichotomous choice, the proposed procedure will choose either or both models.
Collapse
Affiliation(s)
- F Richard Guo
- Department of Statistics, University of Washington, Box 354322, Seattle, Washington 98195, U.S.A
| | - Thomas S Richardson
- Department of Statistics, University of Washington, Box 354322, Seattle, Washington 98195, U.S.A
| |
Collapse
|
40
|
Berrett TB, Wang Y, Barber RF, Samworth RJ. The conditional permutation test for independence while controlling for confounders. J R Stat Soc Series B Stat Methodol 2019. [DOI: 10.1111/rssb.12340] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|