1
|
Doutreligne M, Struja T, Abecassis J, Morgand C, Celi LA, Varoquaux G. Step-by-step causal analysis of EHRs to ground decision-making. PLOS DIGITAL HEALTH 2025; 4:e0000721. [PMID: 39899627 PMCID: PMC11790099 DOI: 10.1371/journal.pdig.0000721] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Accepted: 12/10/2024] [Indexed: 02/05/2025]
Abstract
Causal inference enables machine learning methods to estimate treatment effects of medical interventions from electronic health records (EHRs). The prevalence of such observational data and the difficulty for randomized controlled trials (RCT) to cover all population/treatment relationships make these methods increasingly attractive for studying causal effects. However, researchers should be wary of many pitfalls. We propose and illustrate a framework for causal inference estimating the effect of albumin on mortality in sepsis using an Intensive Care database (MIMIC-IV) and comparing various sensitivity analyses to results from RCTs as gold-standard. The first step is study design, using the target trial concept and the PICOT framework: Population (patients with sepsis), Intervention (combination of crystalloids and albumin for fluid resuscitation), Control (crystalloids only), Outcome (28-day mortality), Time (intervention start within 24h of admission). We show that too large treatment-initiation times induce immortal time bias. The second step is selection of the confounding variables based on expert knowledge. Increasingly adding confounders enables to recover the RCT results from observational data. As the third step, we assess the influence of multiple models with varying assumptions, showing that a doubly robust estimator (AIPW) with random forests proved to be the most reliable estimator. Results show that these steps are all important for valid causal estimates. A valid causal model can then be used to individualize decision making: subgroup analyses showed that treatment efficacy of albumin was better for patients >60 years old, males, and patients with septic shock. Without causal thinking, machine learning is not enough for optimal clinical decision on an individual patient level. Our step-by-step analytic framework helps avoiding many pitfalls of applying machine learning to EHR data, building models that avoid shortcuts and extract the best decision-making evidence.
Collapse
Affiliation(s)
- Matthieu Doutreligne
- Soda Team, Inria Saclay, Palaiseau, France
- Mission Data, Haute Autorité de Santé, Saint-Denis, France
| | - Tristan Struja
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
- Medical University Clinic, Division of Endocrinology, Diabetes & Metabolism, Kantonsspital Aarau, Aarau, Switzerland
| | | | - Claire Morgand
- Agence Régionale de Santé Ile-de-France, Saint-Denis, France
| | - Leo Anthony Celi
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
- Division of Pulmonary, Critical Care and Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts, United States of America
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America
| | | |
Collapse
|
2
|
Burstyn I. The mockery that confounds better treatment of confounding in epidemiology: The change in estimate fallacy. GLOBAL EPIDEMIOLOGY 2024; 8:100166. [PMID: 39410942 PMCID: PMC11474205 DOI: 10.1016/j.gloepi.2024.100166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2024] [Revised: 08/15/2024] [Accepted: 09/25/2024] [Indexed: 10/19/2024] Open
Abstract
Confounding is one of the most infamous bugbears of epidemiology, used by some to dismiss the field's utility outright. The subject has received considerable attention from epidemiologists and the field boasts a remarkable arsenal for addressing the issue. However, it appears that there are still misconceptions about how to identify variables that cause confounding (a lack of exchangeability) in epidemiologic practice. In this commentary, I examine whether analysis of the properties of change-in-estimate method for identification of confounding, exemplified by two highly cited papers, has been appropriately cited in published reports and whether it was utilized to improve epidemiologic practice. I conclude that the myth that a change-in-estimate criterion of 10 % is legitimate for identifying confounding persists in epidemiological practice, despite having been discredited by several independent research groups decades ago. Speculations on possible solutions to this problem are offered, but my work's main contribution is identification of a problem of how methodological advances in epidemiology may be misapplied. There currently do not exist any universal criteria for identification of confounding! "Citation without representation" or biased presentation of conclusions of methodological research may be pervasive.
Collapse
Affiliation(s)
- Igor Burstyn
- Department of Environmental and Occupational Health, Dornsife School of Public Health, Drexel University, Philadelphia, PA, United States of America
| |
Collapse
|
3
|
Song J, Yang X, Wu J, Wu Z, Zhuo L, Hong J, Su L, Lyu W, Ye J, Fang Y, Zhan Z, Zhang H, Li X. Could nutrition status predict fatigue one week before in patients with nasopharynx cancer undergoing radiotherapy? Cancer Med 2024; 13:e7191. [PMID: 38659395 PMCID: PMC11043677 DOI: 10.1002/cam4.7191] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 03/15/2024] [Accepted: 04/03/2024] [Indexed: 04/26/2024] Open
Abstract
BACKGROUND Patients with nasopharyngeal carcinoma (NPC) undergoing radiotherapy experience significant fatigue, which is frequently underestimated due to the lack of objective indicators for its evaluation. This study aimed to explore the longitudinal association between fatigue and nutrition status 1 week in advance. METHODS From January 2021 to June 2022, a total of 105 NPC patients who received intensity-modulated radiation therapy were enrolled in the observational longitudinal study. The significant outcomes, including the Piper Fatigue Scale-12 (PFS-12), the Scored Patient-Generated Subjective Global Assessment (PG-SGA), four body composition indices, and the Hospital Anxiety and Depression Scale (HADS), were assessed weekly from pre-treatment until the completion of radiotherapy (T0-T7) to explore their relationship. RESULTS The trajectories of PFS-12 and all dimensions for 105 participants reached a peak during the fifth week. Sensory fatigue consistently received the highest scores (T0 = 1.60 ± 2.20, T5 = 6.15 ± 1.57), whereas behavior fatigue exhibited the fastest increase over time (T0 = 1.11 ± 1.86, T5 = 5.47 ± 1.70). Higher PG-SGA scores were found to be weakly explainable for aggravating fatigue (β = 0.02 ~ 0.04). Unlike generalized additive mixed models, marginal structural models (MSM) produced larger effect values (β = 0.12 ~ 0.21). Additionally, body composition indices showed weakly negative relationships with fatigue in MSMs one week in advance. CONCLUSIONS The PG-SGA may be a more accurate predictor of future-week fatigue than individual body composition indicators, particularly when HADS is controlled for as a time-dependent confounder.
Collapse
Affiliation(s)
- Jihong Song
- School of NursingHealth Science Center, Xi'an Jiaotong UniversityXi'anChina
- School of NursingFujian Medical UniversityFuzhouChina
| | - Xinru Yang
- School of NursingFujian Medical UniversityFuzhouChina
| | - Jieling Wu
- School of NursingFujian Medical UniversityFuzhouChina
| | - Zilan Wu
- School of NursingFujian Medical UniversityFuzhouChina
| | - Litao Zhuo
- School of NursingFujian Medical UniversityFuzhouChina
| | - Jinsheng Hong
- Department of RadiotherapyCancer Center, the First Affiliated Hospital of Fujian Medical UniversityFuzhouChina
- Department of RadiotherapyNational Regional Medical Center, Binhai Campus of the First Affiliated Hospital, Fujian Medical UniversityFuzhouChina
- Key Laboratory of Radiation Biology of Fujian Higher Education Institutions, The First Affiliated Hospital, Fujian Medical UniversityFuzhouChina
| | - Li Su
- Department of RadiotherapyCancer Center, the First Affiliated Hospital of Fujian Medical UniversityFuzhouChina
- Department of RadiotherapyNational Regional Medical Center, Binhai Campus of the First Affiliated Hospital, Fujian Medical UniversityFuzhouChina
- Key Laboratory of Radiation Biology of Fujian Higher Education Institutions, The First Affiliated Hospital, Fujian Medical UniversityFuzhouChina
| | - Wenlong Lyu
- Department of RadiotherapyCancer Center, the First Affiliated Hospital of Fujian Medical UniversityFuzhouChina
- Department of RadiotherapyNational Regional Medical Center, Binhai Campus of the First Affiliated Hospital, Fujian Medical UniversityFuzhouChina
- Key Laboratory of Radiation Biology of Fujian Higher Education Institutions, The First Affiliated Hospital, Fujian Medical UniversityFuzhouChina
| | - Jinru Ye
- Department of RadiotherapyCancer Center, the First Affiliated Hospital of Fujian Medical UniversityFuzhouChina
- Department of RadiotherapyNational Regional Medical Center, Binhai Campus of the First Affiliated Hospital, Fujian Medical UniversityFuzhouChina
- Key Laboratory of Radiation Biology of Fujian Higher Education Institutions, The First Affiliated Hospital, Fujian Medical UniversityFuzhouChina
| | - Yan Fang
- Nursing DepartmentThe First Affiliated Hospital of Fujian Medical UniversityFuzhouChina
| | - Zhiying Zhan
- Department of Epidemiology and Health StatisticsFujian Provincial Key Laboratory of Environment Factors and Cancer, School of Public Health, Fujian Medical UniversityFuzhouChina
| | - Hairong Zhang
- Fujian Center for Disease Control and PreventionFuzhouChina
| | - Xiaomei Li
- School of NursingHealth Science Center, Xi'an Jiaotong UniversityXi'anChina
| |
Collapse
|
4
|
Cheng D, Li J, Liu L, Yu K, Duy Le T, Liu J. Toward Unique and Unbiased Causal Effect Estimation From Data With Hidden Variables. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:6108-6120. [PMID: 34995195 DOI: 10.1109/tnnls.2021.3133337] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Causal effect estimation from observational data is a crucial but challenging task. Currently, only a limited number of data-driven causal effect estimation methods are available. These methods either provide only a bound estimation of causal effects of treatment on the outcome or generate a unique estimation of the causal effect but making strong assumptions on data and having low efficiency. In this article, we identify a problem setting with the Cause Or Spouse of the treatment Only (COSO) variable assumption and propose an approach to achieving a unique and unbiased estimation of causal effects from data with hidden variables. For the approach, we have developed the theorems to support the discovery of the proper covariate sets for confounding adjustment (adjustment sets). Based on the theorems, two algorithms are proposed for finding the proper adjustment sets from data with hidden variables to obtain unbiased and unique causal effect estimation. Experiments with synthetic datasets generated using five benchmark Bayesian networks and four real-world datasets have demonstrated the efficiency and effectiveness of the proposed algorithms, indicating the practicability of the identified problem setting and the potential of the proposed approach in real-world applications.
Collapse
|
5
|
Dhillon SK, Ganggayah MD, Sinnadurai S, Lio P, Taib NA. Theory and Practice of Integrating Machine Learning and Conventional Statistics in Medical Data Analysis. Diagnostics (Basel) 2022; 12:2526. [PMID: 36292218 PMCID: PMC9601117 DOI: 10.3390/diagnostics12102526] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2022] [Revised: 09/26/2022] [Accepted: 10/04/2022] [Indexed: 11/16/2022] Open
Abstract
The practice of medical decision making is changing rapidly with the development of innovative computing technologies. The growing interest of data analysis with improvements in big data computer processing methods raises the question of whether machine learning can be integrated with conventional statistics in health research. To help address this knowledge gap, this paper presents a review on the conceptual integration between conventional statistics and machine learning, focusing on the health research. The similarities and differences between the two are compared using mathematical concepts and algorithms. The comparison between conventional statistics and machine learning methods indicates that conventional statistics are the fundamental basis of machine learning, where the black box algorithms are derived from basic mathematics, but are advanced in terms of automated analysis, handling big data and providing interactive visualizations. While the nature of both these methods are different, they are conceptually similar. Based on our review, we conclude that conventional statistics and machine learning are best to be integrated to develop automated data analysis tools. We also strongly believe that machine learning could be explored by health researchers to enhance conventional statistics in decision making for added reliable validation measures.
Collapse
Affiliation(s)
- Sarinder Kaur Dhillon
- Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, Universiti Malaya, Kuala Lumpur 50603, Malaysia
| | - Mogana Darshini Ganggayah
- Department of Econometrics and Business Statistics, School of Business, Monash University Malaysia, Kuala Lumpur 47500, Malaysia
| | - Siamala Sinnadurai
- Department of Population Medicine and Lifestyle Disease Prevention, Medical University of Bialystok, 15-269 Bialystok, Poland
| | - Pietro Lio
- Department of Computer Science and Technology, University of Cambridge, 15 JJ Thomson Avenue, Cambridge CB3 0FD, UK
| | - Nur Aishah Taib
- Department of Surgery, Faculty of Medicine, Universiti Malaya, Kuala Lumpur 50603, Malaysia
| |
Collapse
|
7
|
Cheng D, Li J, Liu L, Le TD, Liu J, Yu K. Sufficient dimension reduction for average causal effect estimation. Data Min Knowl Discov 2022. [DOI: 10.1007/s10618-022-00832-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
AbstractA large number of covariates can have a negative impact on the quality of causal effect estimation since confounding adjustment becomes unreliable when the number of covariates is large relative to the number of samples. Propensity score is a common way to deal with a large covariate set, but the accuracy of propensity score estimation (normally done by logistic regression) is also challenged by the large number of covariates. In this paper, we prove that a large covariate set can be reduced to a lower dimensional representation which captures the complete information for adjustment in causal effect estimation. The theoretical result enables effective data-driven algorithms for causal effect estimation. Supported by the result, we develop an algorithm that employs a supervised kernel dimension reduction method to learn a lower dimensional representation from the original covariate space, and then utilises nearest neighbour matching in the reduced covariate space to impute the counterfactual outcomes to avoid the large sized covariate set problem. The proposed algorithm is evaluated on two semisynthetic and three real-world datasets and the results show the effectiveness of the proposed algorithm.
Collapse
|
8
|
Sun X, Wang L, Li H, Jin C, Yu Y, Hou L, Liu X, Yu Y, Yan R, Xue F. Identification of microenvironment related potential biomarkers of biochemical recurrence at 3 years after prostatectomy in prostate adenocarcinoma. Aging (Albany NY) 2021; 13:16024-16042. [PMID: 34133324 PMCID: PMC8266350 DOI: 10.18632/aging.203121] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Accepted: 05/11/2021] [Indexed: 12/15/2022]
Abstract
Prostate adenocarcinoma is one of the leading adult malignancies. Identification of multiple causative biomarkers is necessary and helpful for determining the occurrence and prognosis of prostate adenocarcinoma. We aimed to identify the potential prognostic genes in the prostate adenocarcinoma microenvironment and to estimate the causal effects simultaneously. We obtained the gene expression data of prostate adenocarcinoma from TCGA project and identified the differentially expressed genes based on immune-stromal components. Among these genes, 68 were associated with biochemical recurrence at 3 years after prostatectomy in prostate adenocarcinoma. After adjusting for the minimal sets of confounding covariates, 14 genes (TNFRSF4, ZAP70, ERMN, CXCL5, SPINK6, SLC6A18, CHRM2, TG, CLLU1OS, POSTN, CTSG, NETO1, CEACAM7, and IGLV3-22) related to the microenvironment were identified as prognostic biomarkers using the targeted maximum likelihood estimation. Both the average and individual causal effects were obtained to measure the magnitude of the effect. CIBERSORT and gene set enrichment analyses showed that these prognostic genes were mainly associated with immune responses. POSTN and NETO1 were correlated with androgen receptor expression, a main driver of prostate adenocarcinoma progression. Finally, five genes were validated in another prostate adenocarcinoma cohort (GEO: GSE70770). These findings might lead to the improved prognosis of prostate adenocarcinoma.
Collapse
Affiliation(s)
- Xiaoru Sun
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan 250012, Shandong, China.,Institute for Medical Dataology, Cheeloo College of Medicine, Shandong University, Jinan 250012, Shandong, China
| | - Lu Wang
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan 250012, Shandong, China.,Institute for Medical Dataology, Cheeloo College of Medicine, Shandong University, Jinan 250012, Shandong, China
| | - Hongkai Li
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan 250012, Shandong, China.,Institute for Medical Dataology, Cheeloo College of Medicine, Shandong University, Jinan 250012, Shandong, China
| | - Chuandi Jin
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan 250012, Shandong, China.,Institute for Medical Dataology, Cheeloo College of Medicine, Shandong University, Jinan 250012, Shandong, China
| | - Yuanyuan Yu
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan 250012, Shandong, China.,Institute for Medical Dataology, Cheeloo College of Medicine, Shandong University, Jinan 250012, Shandong, China
| | - Lei Hou
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan 250012, Shandong, China.,Institute for Medical Dataology, Cheeloo College of Medicine, Shandong University, Jinan 250012, Shandong, China
| | - Xinhui Liu
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan 250012, Shandong, China.,Institute for Medical Dataology, Cheeloo College of Medicine, Shandong University, Jinan 250012, Shandong, China
| | - Yifan Yu
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan 250012, Shandong, China.,Institute for Medical Dataology, Cheeloo College of Medicine, Shandong University, Jinan 250012, Shandong, China
| | - Ran Yan
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan 250012, Shandong, China.,Institute for Medical Dataology, Cheeloo College of Medicine, Shandong University, Jinan 250012, Shandong, China
| | - Fuzhong Xue
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan 250012, Shandong, China.,Institute for Medical Dataology, Cheeloo College of Medicine, Shandong University, Jinan 250012, Shandong, China
| |
Collapse
|