1
|
Chaibub Neto E. Causality-Aware Predictions in Static Anticausal Machine Learning Tasks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:5039-5053. [PMID: 36103435 DOI: 10.1109/tnnls.2022.3202151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
We propose a counterfactual approach to train "causality-aware" predictive models that are able to leverage causal information in static anticausal machine learning tasks (i.e., prediction tasks where the outcome influences the inputs). In applications plagued by confounding, the approach can be used to generate predictions that are free from the influence of observed confounders. In applications involving observed mediators, the approach can be used to generate predictions that only capture the direct or the indirect causal influences. Mechanistically, we train supervised learners on (counterfactually) simulated inputs that retain only the associations generated by the causal relations of interest. We focus on linear models, where analytical results connecting covariances, causal effects, and prediction mean square errors are readily available. Quite importantly, we show that our approach does not require knowledge of the full causal graph. It suffices to know which variables represent potential confounders and/or mediators. We investigate the stability of the method with respect to dataset shifts generated by selection biases and also relax the linearity assumption by extending the approach to additive models better able to account for nonlinearities in the data. We validate our approach in a series of synthetic data experiments and illustrate its application to a real dataset.
Collapse
|
2
|
Zhou F, He K, Ni Y. Individualized causal discovery with latent trajectory embedded Bayesian networks. Biometrics 2023; 79:3191-3202. [PMID: 36807295 DOI: 10.1111/biom.13843] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2021] [Revised: 12/08/2022] [Accepted: 02/06/2023] [Indexed: 02/20/2023]
Abstract
Bayesian networks have been widely used to generate causal hypotheses from multivariate data. Despite their popularity, the vast majority of existing causal discovery approaches make the strong assumption of a (partially) homogeneous sampling scheme. However, such assumption can be seriously violated, causing significant biases when the underlying population is inherently heterogeneous. To this end, we propose a novel causal Bayesian network model, termed BN-LTE, that embeds heterogeneous samples onto a low-dimensional manifold and builds Bayesian networks conditional on the embedding. This new framework allows for more precise network inference by improving the estimation resolution from the population level to the observation level. Moreover, while causal Bayesian networks are in general not identifiable with purely observational, cross-sectional data due to Markov equivalence, with the blessing of causal effect heterogeneity, we prove that the proposed BN-LTE is uniquely identifiable under relatively mild assumptions. Through extensive experiments, we demonstrate the superior performance of BN-LTE in causal structure learning as well as inferring observation-specific gene regulatory networks from observational data.
Collapse
Affiliation(s)
- Fangting Zhou
- Institute of Statistics and Big Data, Renmin University of China, Beijing, China
- Department of Statistics, Texas A&M University, College Station, Texas, USA
| | - Kejun He
- Institute of Statistics and Big Data, Renmin University of China, Beijing, China
| | - Yang Ni
- Department of Statistics, Texas A&M University, College Station, Texas, USA
| |
Collapse
|
3
|
Long JP, Zhu H, Do KA, Ha MJ. Estimating causal effects with hidden confounding using instrumental variables and environments. Electron J Stat 2023; 17:2849-2879. [PMID: 38957485 PMCID: PMC11219021 DOI: 10.1214/23-ejs2160] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/04/2024]
Abstract
Recent works have proposed regression models which are invariant across data collection environments [24, 20, 11, 16, 8]. These estimators often have a causal interpretation under conditions on the environments and type of invariance imposed. One recent example, the Causal Dantzig (CD), is consistent under hidden confounding and represents an alternative to classical instrumental variable estimators such as Two Stage Least Squares (TSLS). In this work we derive the CD as a generalized method of moments (GMM) estimator. The GMM representation leads to several practical results, including 1) creation of the Generalized Causal Dantzig (GCD) estimator which can be applied to problems with continuous environments where the CD cannot be fit 2) a Hybrid (GCD-TSLS combination) estimator which has properties superior to GCD or TSLS alone 3) straightforward asymptotic results for all methods using GMM theory. We compare the CD, GCD, TSLS, and Hybrid estimators in simulations and an application to a Flow Cytometry data set. The newly proposed GCD and Hybrid estimators have superior performance to existing methods in many settings.
Collapse
Affiliation(s)
- James P Long
- Department of Biostatistics, University of Texas MD Anderson Cancer Center
| | - Hongxu Zhu
- Department of Biostatistics, University of Texas, School of Public Health
| | - Kim-Anh Do
- Department of Biostatistics, University of Texas MD Anderson Cancer Center
| | - Min Jin Ha
- Department of Biostatistics, Graduate School of Public Health, Yonsei University
| |
Collapse
|
4
|
Xiong R, Koenecke A, Powell M, Shen Z, Vogelstein JT, Athey S. Federated causal inference in heterogeneous observational data. Stat Med 2023; 42:4418-4439. [PMID: 37553084 DOI: 10.1002/sim.9868] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2022] [Revised: 04/02/2023] [Accepted: 07/14/2023] [Indexed: 08/10/2023]
Abstract
We are interested in estimating the effect of a treatment applied to individuals at multiple sites, where data is stored locally for each site. Due to privacy constraints, individual-level data cannot be shared across sites; the sites may also have heterogeneous populations and treatment assignment mechanisms. Motivated by these considerations, we develop federated methods to draw inferences on the average treatment effects of combined data across sites. Our methods first compute summary statistics locally using propensity scores and then aggregate these statistics across sites to obtain point and variance estimators of average treatment effects. We show that these estimators are consistent and asymptotically normal. To achieve these asymptotic properties, we find that the aggregation schemes need to account for the heterogeneity in treatment assignments and in outcomes across sites. We demonstrate the validity of our federated methods through a comparative study of two large medical claims databases.
Collapse
Affiliation(s)
- Ruoxuan Xiong
- Department of Quantitative Theory and Methods, Emory University, Atlanta, Georgia, USA
| | - Allison Koenecke
- Department of Information Science, Cornell University, Ithaca, New York, USA
| | - Michael Powell
- Department of Mathematical Sciences, United States Military Academy, West Point, New York, USA
| | - Zhu Shen
- Department of Biostatistics, Harvard University, Cambridge, Massachusetts, USA
| | - Joshua T Vogelstein
- Department of Biomedical Engineering, Institute for Computational Medicine, Johns Hopkins University, Baltimore, Maryland, USA
| | - Susan Athey
- Graduate School of Business, Stanford University, Stanford, California, USA
| |
Collapse
|
5
|
Marmolejo‐Ramos F, Tejo M, Brabec M, Kuzilek J, Joksimovic S, Kovanovic V, González J, Kneib T, Bühlmann P, Kook L, Briseño‐Sánchez G, Ospina R. Distributional regression modeling via generalized additive models for location, scale, and shape: An overview through a data set from learning analytics. WILEY INTERDISCIPLINARY REVIEWS. DATA MINING AND KNOWLEDGE DISCOVERY 2023; 13:e1479. [PMID: 37502671 PMCID: PMC10369920 DOI: 10.1002/widm.1479] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 06/11/2022] [Accepted: 10/05/2022] [Indexed: 07/29/2023]
Abstract
The advent of technological developments is allowing to gather large amounts of data in several research fields. Learning analytics (LA)/educational data mining has access to big observational unstructured data captured from educational settings and relies mostly on unsupervised machine learning (ML) algorithms to make sense of such type of data. Generalized additive models for location, scale, and shape (GAMLSS) are a supervised statistical learning framework that allows modeling all the parameters of the distribution of the response variable with respect to the explanatory variables. This article overviews the power and flexibility of GAMLSS in relation to some ML techniques. Also, GAMLSS' capability to be tailored toward causality via causal regularization is briefly commented. This overview is illustrated via a data set from the field of LA. This article is categorized under:Application Areas > Education and LearningAlgorithmic Development > StatisticsTechnologies > Machine Learning.
Collapse
Affiliation(s)
| | - Mauricio Tejo
- Instituto de EstadísticaUniversidad de ValparaísoValparaísoChile
| | - Marek Brabec
- Department of Statistical ModellingInstitute of Computer Science of the Czech Academy of SciencesPragueCzech Republic
| | - Jakub Kuzilek
- Czech Institute of InformaticsRobotics and Cybernetics, CTUPragueCzech Republic
- Computer Science Education/Computer Science and Society Research GroupHumboldt University of BerlinBerlinGermany
| | - Srecko Joksimovic
- Centre for Change and Complexity in LearningUniversity of South AustraliaAdelaideAustralia
| | - Vitomir Kovanovic
- Centre for Change and Complexity in LearningUniversity of South AustraliaAdelaideAustralia
| | - Jorge González
- Departamento de EstadísticaPontificia Universidad Católica de ChileSantiago de ChileChile
| | - Thomas Kneib
- Campus Institute Data Science (CIDAS) and Chair of StatisticsGeorg‐August‐Universität GöttingenGöttingenGermany
| | | | - Lucas Kook
- Epidemiology, Biostatistics, and Prevention InstituteUniversity of ZurichZurichSwitzerland
- Institute of Data Analysis and Process DesignZurich University of Applied SciencesWinterthurSwitzerland
| | | | - Raydonal Ospina
- Department of Statistics, CASTLabFederal University of PernambucoRecifeBrazil
| |
Collapse
|
6
|
Christiansen R, Pfister N, Jakobsen ME, Gnecco N, Peters J. A Causal Framework for Distribution Generalization. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:6614-6630. [PMID: 34232865 DOI: 10.1109/tpami.2021.3094760] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
We consider the problem of predicting a response Y from a set of covariates X when test- and training distributions differ. Since such differences may have causal explanations, we consider test distributions that emerge from interventions in a structural causal model, and focus on minimizing the worst-case risk. Causal regression models, which regress the response on its direct causes, remain unchanged under arbitrary interventions on the covariates, but they are not always optimal in the above sense. For example, for linear models and bounded interventions, alternative solutions have been shown to be minimax prediction optimal. We introduce the formal framework of distribution generalization that allows us to analyze the above problem in partially observed nonlinear models for both direct interventions on X and interventions that occur indirectly via exogenous variables A. It takes into account that, in practice, minimax solutions need to be identified from data. Our framework allows us to characterize under which class of interventions the causal function is minimax optimal. We prove sufficient conditions for distribution generalization and present corresponding impossibility results. We propose a practical method, NILE, that achieves distribution generalization in a nonlinear IV setting with linear extrapolation. We prove consistency and present empirical results.
Collapse
|
7
|
Li S, Sesia M, Romano Y, Candès E, Sabatti C. Searching for robust associations with a multi-environment knockoff filter. Biometrika 2022; 109:611-629. [PMID: 38633763 PMCID: PMC11022501 DOI: 10.1093/biomet/asab055] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/19/2024] Open
Abstract
This paper develops a method based on model-X knockoffs to find conditional associations that are consistent across environments, controlling the false discovery rate. The motivation for this problem is that large data sets may contain numerous associations that are statistically significant and yet misleading, as they are induced by confounders or sampling imperfections. However, associations replicated under different conditions may be more interesting. In fact, consistency sometimes provably leads to valid causal inferences even if conditional associations do not. While the proposed method is widely applicable, this paper highlights its relevance to genome-wide association studies, in which robustness across populations with diverse ancestries mitigates confounding due to unmeasured variants. The effectiveness of this approach is demonstrated by simulations and applications to the UK Biobank data.
Collapse
Affiliation(s)
- S Li
- Department of Statistics, Stanford University, Stanford, California 94305, USA
| | - M Sesia
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, California 90089, USA
| | - Y Romano
- Departments of Electrical Engineering and of Computer Science, Technion, Haifa, Israel
| | - E Candès
- Department of Statistics, Stanford University, Stanford, California 94305, USA
| | - C Sabatti
- Department of Statistics, Stanford University, Stanford, California 94305, USA
| |
Collapse
|
8
|
Kook L, Sick B, Bühlmann P. Distributional anchor regression. STATISTICS AND COMPUTING 2022; 32:39. [PMID: 35582000 PMCID: PMC9106647 DOI: 10.1007/s11222-022-10097-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/22/2021] [Accepted: 04/12/2022] [Indexed: 06/15/2023]
Abstract
Prediction models often fail if train and test data do not stem from the same distribution. Out-of-distribution (OOD) generalization to unseen, perturbed test data is a desirable but difficult-to-achieve property for prediction models and in general requires strong assumptions on the data generating process (DGP). In a causally inspired perspective on OOD generalization, the test data arise from a specific class of interventions on exogenous random variables of the DGP, called anchors. Anchor regression models, introduced by Rothenhäusler et al. (J R Stat Soc Ser B 83(2):215-246, 2021. 10.1111/rssb.12398), protect against distributional shifts in the test data by employing causal regularization. However, so far anchor regression has only been used with a squared-error loss which is inapplicable to common responses such as censored continuous or ordinal data. Here, we propose a distributional version of anchor regression which generalizes the method to potentially censored responses with at least an ordered sample space. To this end, we combine a flexible class of parametric transformation models for distributional regression with an appropriate causal regularizer under a more general notion of residuals. In an exemplary application and several simulation scenarios we demonstrate the extent to which OOD generalization is possible.
Collapse
Affiliation(s)
- Lucas Kook
- Epidemiology, Biostatistics and Prevention Institute, University of Zurich, 8001 Zurich, Switzerland
- Institute of Data Analysis and Process Design, Zurich University of Applied Sciences, 8400 Winterthur, Switzerland
| | - Beate Sick
- Epidemiology, Biostatistics and Prevention Institute, University of Zurich, 8001 Zurich, Switzerland
- Institute of Data Analysis and Process Design, Zurich University of Applied Sciences, 8400 Winterthur, Switzerland
| | - Peter Bühlmann
- Seminar for Statistics ETH Zurich, 8092 Zurich, Switzerland
| |
Collapse
|
9
|
Lund A, Wengel Mogensen S, Hansen NR. Soft Maximin Estimation for Heterogeneous Data. Scand Stat Theory Appl 2022. [DOI: 10.1111/sjos.12580] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
10
|
Zhao Y, Yu Y, Wang H, Li Y, Deng Y, Jiang G, Luo Y. Machine Learning in Causal Inference: Application in Pharmacovigilance. Drug Saf 2022; 45:459-476. [PMID: 35579811 PMCID: PMC9114053 DOI: 10.1007/s40264-022-01155-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/09/2022] [Indexed: 01/28/2023]
Abstract
Monitoring adverse drug events or pharmacovigilance has been promoted by the World Health Organization to assure the safety of medicines through a timely and reliable information exchange regarding drug safety issues. We aim to discuss the application of machine learning methods as well as causal inference paradigms in pharmacovigilance. We first reviewed data sources for pharmacovigilance. Then, we examined traditional causal inference paradigms, their applications in pharmacovigilance, and how machine learning methods and causal inference paradigms were integrated to enhance the performance of traditional causal inference paradigms. Finally, we summarized issues with currently mainstream correlation-based machine learning models and how the machine learning community has tried to address these issues by incorporating causal inference paradigms. Our literature search revealed that most existing data sources and tasks for pharmacovigilance were not designed for causal inference. Additionally, pharmacovigilance was lagging in adopting machine learning-causal inference integrated models. We highlight several currently trending directions or gaps to integrate causal inference with machine learning in pharmacovigilance research. Finally, our literature search revealed that the adoption of causal paradigms can mitigate known issues with machine learning models. We foresee that the pharmacovigilance domain can benefit from the progress in the machine learning field.
Collapse
Affiliation(s)
- Yiqing Zhao
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, 750 N Lake Shore Drive, Room 11-189, Chicago, IL, 60611, USA
| | - Yue Yu
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, 55902, USA
| | - Hanyin Wang
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, 750 N Lake Shore Drive, Room 11-189, Chicago, IL, 60611, USA
| | - Yikuan Li
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, 750 N Lake Shore Drive, Room 11-189, Chicago, IL, 60611, USA
| | - Yu Deng
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, 750 N Lake Shore Drive, Room 11-189, Chicago, IL, 60611, USA
| | - Guoqian Jiang
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, 55902, USA
| | - Yuan Luo
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, 750 N Lake Shore Drive, Room 11-189, Chicago, IL, 60611, USA.
| |
Collapse
|
11
|
Sippel S, Meinshausen N, Székely E, Fischer E, Pendergrass AG, Lehner F, Knutti R. Robust detection of forced warming in the presence of potentially large climate variability. SCIENCE ADVANCES 2021; 7:eabh4429. [PMID: 34678070 PMCID: PMC8535853 DOI: 10.1126/sciadv.abh4429] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/12/2021] [Accepted: 09/02/2021] [Indexed: 06/13/2023]
Abstract
Climate warming is unequivocal and exceeds internal climate variability. However, estimates of the magnitude of decadal-scale variability from models and observations are uncertain, limiting determination of the fraction of warming attributable to external forcing. Here, we use statistical learning to extract a fingerprint of climate change that is robust to different model representations and magnitudes of internal variability. We find a best estimate forced warming trend of 0.8°C over the past 40 years, slightly larger than observed. It is extremely likely that at least 85% is attributable to external forcing based on the median variability across climate models. Detection remains robust even when evaluated against models with high variability and if decadal-scale variability were doubled. This work addresses a long-standing limitation in attributing warming to external forcing and opens up opportunities even in the case of large model differences in decadal-scale variability, model structural uncertainty, and limited observational records.
Collapse
Affiliation(s)
- Sebastian Sippel
- Institute for Atmospheric and Climate Science, ETH Zurich, Zurich, Switzerland
- Seminar for Statistics, ETH Zurich, Zurich, Switzerland
| | | | - Enikő Székely
- Swiss Data Science Center, ETH Zurich and EPFL, Lausanne, Switzerland
| | - Erich Fischer
- Institute for Atmospheric and Climate Science, ETH Zurich, Zurich, Switzerland
| | - Angeline G. Pendergrass
- Institute for Atmospheric and Climate Science, ETH Zurich, Zurich, Switzerland
- Department of Earth and Atmospheric Sciences, Cornell University, Ithaca, NY 14850, USA
- Climate and Global Dynamics Laboratory, National Center for Atmospheric Research, Boulder, CO 80305, USA
| | - Flavio Lehner
- Institute for Atmospheric and Climate Science, ETH Zurich, Zurich, Switzerland
- Department of Earth and Atmospheric Sciences, Cornell University, Ithaca, NY 14850, USA
- Climate and Global Dynamics Laboratory, National Center for Atmospheric Research, Boulder, CO 80305, USA
| | - Reto Knutti
- Institute for Atmospheric and Climate Science, ETH Zurich, Zurich, Switzerland
| |
Collapse
|
12
|
Erratum: Anchor regression: Heterogeneous data meet causality. J R Stat Soc Series B Stat Methodol 2021. [DOI: 10.1111/rssb.12440] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
13
|
Pfister N, Williams EG, Peters J, Aebersold R, Bühlmann P. Stabilizing variable selection and regression. Ann Appl Stat 2021. [DOI: 10.1214/21-aoas1487] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Niklas Pfister
- Department of Mathematical Sciences, University of Copenhagen
| | - Evan G. Williams
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg
| | - Jonas Peters
- Department of Mathematical Sciences, University of Copenhagen
| | | | | |
Collapse
|
14
|
Emmenegger C, Bühlmann P. Regularizing double machine learning in partially linear endogenous models. Electron J Stat 2021. [DOI: 10.1214/21-ejs1931] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|