1
|
Inoue K, Adomi M, Efthimiou O, Komura T, Omae K, Onishi A, Tsutsumi Y, Fujii T, Kondo N, Furukawa TA. Machine learning approaches to evaluate heterogeneous treatment effects in randomized controlled trials: a scoping review. J Clin Epidemiol 2024; 176:111538. [PMID: 39305940 DOI: 10.1016/j.jclinepi.2024.111538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Revised: 09/06/2024] [Accepted: 09/16/2024] [Indexed: 10/20/2024]
Abstract
BACKGROUND AND OBJECTIVES Estimating heterogeneous treatment effects (HTEs) in randomized controlled trials (RCTs) has received substantial attention recently. This has led to the development of several statistical and machine learning (ML) algorithms to assess HTEs through identifying individualized treatment effects. However, a comprehensive review of these algorithms is lacking. We thus aimed to catalog and outline currently available statistical and ML methods for identifying HTEs via effect modeling using clinical RCT data and summarize how they have been applied in practice. STUDY DESIGN AND SETTING We performed a scoping review using prespecified search terms in MEDLINE and Embase, aiming to identify studies that assessed HTEs using advanced statistical and ML methods in RCT data published from 2010 to 2022. RESULTS Among a total of 32 studies identified in the review, 17 studies applied existing algorithms to RCT data, and 15 extended existing algorithms or proposed new algorithms. Applied algorithms included penalized regression, causal forest, Bayesian causal forest, and other metalearner frameworks. Of these methods, causal forest was the most frequently used (7 studies) followed by Bayesian causal forest (4 studies). Most applications were in cardiology (6 studies), followed by psychiatry (4 studies). We provide example R codes in simulated data to illustrate how to implement these algorithms. CONCLUSION This review identified and outlined various algorithms currently used to identify HTEs and individualized treatment effects in RCT data. Given the increasing availability of new algorithms, analysts should carefully select them after examining model performance and considering how the models will be used in practice.
Collapse
Affiliation(s)
- Kosuke Inoue
- Department of Social Epidemiology, Graduate School of Medicine, Kyoto University, Kyoto, Japan; Hakubi Center, Kyoto University, Kyoto, Japan.
| | - Motohiko Adomi
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Orestis Efthimiou
- Institute of Primary Health Care (BIHAM), University of Bern, Bern, Switzerland; Institute of Social and Preventive Medicine (ISPM), University of Bern, Bern, Switzerland
| | - Toshiaki Komura
- Department of Epidemiology, School of Public Health, Boston University, Boston, MA, USA
| | - Kenji Omae
- Department of Innovative Research and Education for Clinicians and Trainees, Fukushima Medical University Hospital, Fukushima, Japan; Center for Innovative Research for Communities and Clinical Excellence, Fukushima Medical University, Fukushima, Japan
| | - Akira Onishi
- Department of Advanced Medicine for Rheumatic Diseases, Kyoto University Graduate School of Medicine, Kyoto, Japan
| | - Yusuke Tsutsumi
- Human Health Sciences, Kyoto University Graduate School of Medicine, Kyoto, Japan; Department of Emergency Medicine, National Hospital Organization Mito Medical Center, Ibaraki, Japan
| | - Tomoko Fujii
- Intensive Care Unit, Jikei University Hospital, Tokyo, Japan; Departments of Health Promotion and Human Behavior and of Clinical Epidemiology, Kyoto University Graduate School of Medicine/School of Public Health, Kyoto, Japan
| | - Naoki Kondo
- Department of Social Epidemiology, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| | - Toshi A Furukawa
- Departments of Health Promotion and Human Behavior and of Clinical Epidemiology, Kyoto University Graduate School of Medicine/School of Public Health, Kyoto, Japan
| |
Collapse
|
2
|
Han S, Goh J, Meng F, Leow MKS, Rubin DB. Contrast-specific propensity scores for causal inference with multiple interventions. Stat Methods Med Res 2024; 33:825-837. [PMID: 38499338 DOI: 10.1177/09622802241236952] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/20/2024]
Abstract
Existing methods that use propensity scores for heterogeneous treatment effect estimation on non-experimental data do not readily extend to the case of more than two treatment options. In this work, we develop a new propensity score-based method for heterogeneous treatment effect estimation when there are three or more treatment options, and prove that it generates unbiased estimates. We demonstrate our method on a real patient registry of patients in Singapore with diabetic dyslipidemia. On this dataset, our method generates heterogeneous treatment recommendations for patients among three options: Statins, fibrates, and non-pharmacological treatment to control patients' lipid ratios (total cholesterol divided by high-density lipoprotein level). In our numerical study, our proposed method generated more stable estimates compared to a benchmark method based on a multi-dimensional propensity score.
Collapse
Affiliation(s)
- Shasha Han
- School of Population Medicine and Public Health, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Joel Goh
- NUS Business School, National University of Singapore, Singapore
- Global Asia Institute, National University of Singapore, Singapore
- Institute of Operations Research and Analytics, National University of Singapore, Singapore
| | - Fanwen Meng
- Department of Health Services & Outcomes Research, National Healthcare Group, Singapore
| | - Melvin Khee-Shing Leow
- Cardiovascular & Metabolic Disorders Programme, Duke-NUS Medical School, Singapore
- Department of Endocrinology, Tan Tock Seng Hospital, Singapore
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
| | - Donald B Rubin
- Department of Statistics, Harvard University, Cambridge, MA, USA
- Department of Statistical Science, Fox Business School, Temple University, Philadelphia, PA, USA
- Yau Mathematical Center, Tsinghua University, Beijing, China
| |
Collapse
|
3
|
Hu L. A new method for clustered survival data: Estimation of treatment effect heterogeneity and variable selection. Biom J 2024; 66:e2200178. [PMID: 38072661 PMCID: PMC10953775 DOI: 10.1002/bimj.202200178] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Revised: 07/31/2023] [Accepted: 08/11/2023] [Indexed: 01/30/2024]
Abstract
We recently developed a new method random-intercept accelerated failure time model with Bayesian additive regression trees (riAFT-BART) to draw causal inferences about population treatment effect on patient survival from clustered and censored survival data while accounting for the multilevel data structure. The practical utility of this method goes beyond the estimation of population average treatment effect. In this work, we exposit how riAFT-BART can be used to solve two important statistical questions with clustered survival data: estimating the treatment effect heterogeneity and variable selection. Leveraging the likelihood-based machine learning, we describe a way in which we can draw posterior samples of the individual survival treatment effect from riAFT-BART model runs, and use the drawn posterior samples to perform an exploratory treatment effect heterogeneity analysis to identify subpopulations who may experience differential treatment effects than population average effects. There is sparse literature on methods for variable selection among clustered and censored survival data, particularly ones using flexible modeling techniques. We propose a permutation-based approach using the predictor's variable inclusion proportion supplied by the riAFT-BART model for variable selection. To address the missing data issue frequently encountered in health databases, we propose a strategy to combine bootstrap imputation and riAFT-BART for variable selection among incomplete clustered survival data. We conduct an expansive simulation study to examine the practical operating characteristics of our proposed methods, and provide empirical evidence that our proposed methods perform better than several existing methods across a wide range of data scenarios. Finally, we demonstrate the methods via a case study of predictors for in-hospital mortality among severe COVID-19 patients and estimating the heterogeneous treatment effects of three COVID-specific medications. The methods developed in this work are readily available in the R ${\textsf {R}}$ package riAFTBART $\textsf {riAFTBART}$ .
Collapse
Affiliation(s)
- Liangyuan Hu
- Department of Biostatistics and Epidemiology, Rutgers University, Piscataway, New Jersey 08854
| |
Collapse
|
4
|
Hu L, Li L. Using Tree-Based Machine Learning for Health Studies: Literature Review and Case Series. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2022; 19:16080. [PMID: 36498153 PMCID: PMC9736500 DOI: 10.3390/ijerph192316080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/15/2022] [Revised: 11/22/2022] [Accepted: 11/24/2022] [Indexed: 06/17/2023]
Abstract
Tree-based machine learning methods have gained traction in the statistical and data science fields. They have been shown to provide better solutions to various research questions than traditional analysis approaches. To encourage the uptake of tree-based methods in health research, we review the methodological fundamentals of three key tree-based machine learning methods: random forests, extreme gradient boosting and Bayesian additive regression trees. We further conduct a series of case studies to illustrate how these methods can be properly used to solve important health research problems in four domains: variable selection, estimation of causal effects, propensity score weighting and missing data. We exposit that the central idea of using ensemble tree methods for these research questions is accurate prediction via flexible modeling. We applied ensemble trees methods to select important predictors for the presence of postoperative respiratory complication among early stage lung cancer patients with resectable tumors. We then demonstrated how to use these methods to estimate the causal effects of popular surgical approaches on postoperative respiratory complications among lung cancer patients. Using the same data, we further implemented the methods to accurately estimate the inverse probability weights for a propensity score analysis of the comparative effectiveness of the surgical approaches. Finally, we demonstrated how random forests can be used to impute missing data using the Study of Women's Health Across the Nation data set. To conclude, the tree-based methods are a flexible tool and should be properly used for health investigations.
Collapse
Affiliation(s)
- Liangyuan Hu
- Department of Biostatistics and Epidemiology, Rutgers University, Piscataway, NJ 08854, USA
| | - Lihua Li
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| |
Collapse
|
5
|
Hu L, Ji J, Ennis RD, Hogan JW. A flexible approach for causal inference with multiple treatments and clustered survival outcomes. Stat Med 2022; 41:4982-4999. [PMID: 35948011 PMCID: PMC9588538 DOI: 10.1002/sim.9548] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2022] [Revised: 07/20/2022] [Accepted: 07/22/2022] [Indexed: 01/07/2023]
Abstract
When drawing causal inferences about the effects of multiple treatments on clustered survival outcomes using observational data, we need to address implications of the multilevel data structure, multiple treatments, censoring, and unmeasured confounding for causal analyses. Few off-the-shelf causal inference tools are available to simultaneously tackle these issues. We develop a flexible random-intercept accelerated failure time model, in which we use Bayesian additive regression trees to capture arbitrarily complex relationships between censored survival times and pre-treatment covariates and use the random intercepts to capture cluster-specific main effects. We develop an efficient Markov chain Monte Carlo algorithm to draw posterior inferences about the population survival effects of multiple treatments and examine the variability in cluster-level effects. We further propose an interpretable sensitivity analysis approach to evaluate the sensitivity of drawn causal inferences about treatment effect to the potential magnitude of departure from the causal assumption of no unmeasured confounding. Expansive simulations empirically validate and demonstrate good practical operating characteristics of our proposed methods. Applying the proposed methods to a dataset on older high-risk localized prostate cancer patients drawn from the National Cancer Database, we evaluate the comparative effects of three treatment approaches on patient survival, and assess the ramifications of potential unmeasured confounding. The methods developed in this work are readily available in theR $$ \mathsf{R}\kern.15em $$ packageriAFTBART $$ \mathsf{riAFTBART} $$ .
Collapse
Affiliation(s)
- Liangyuan Hu
- Department of Biostatistics and EpidemiologyRutgers UniversityPiscatawayNew JerseyUSA
| | - Jiayi Ji
- Department of Biostatistics and EpidemiologyRutgers UniversityPiscatawayNew JerseyUSA
| | - Ronald D. Ennis
- Department of Radiation OncologyCancer Institute of New Jersey of Rutgers UniversityNew BrunswickNew JerseyUSA
| | - Joseph W. Hogan
- Department of BiostatisticsBrown UniversityProvidenceRhode IslandUSA
| |
Collapse
|
6
|
Hu L, Ji J. CIMTx: An R Package for Causal Inference with Multiple Treatments using Observational Data. THE R JOURNAL 2022; 14:213-230. [PMID: 39310290 PMCID: PMC11415261 DOI: 10.32614/rj-2022-058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
CIMTx provides efficient and unified functions to implement modern methods for causal inferences with multiple treatments using observational data with a focus on binary outcomes. The methods include regression adjustment, inverse probability of treatment weighting, Bayesian additive regression trees, regression adjustment with multivariate spline of the generalized propensity score, vector matching and targeted maximum likelihood estimation. In addition, CIMTx illustrates ways in which users can simulate data adhering to the complex data structures in the multiple treatment setting. Furthermore, the CIMTx package offers a unique set of features to address the key causal assumptions: positivity and ignorability. For the positivity assumption, CIMTx demonstrates techniques to identify the common support region for retaining inferential units using inverse probability of treatment weighting, Bayesian additive regression trees and vector matching. To handle the ignorability assumption, CIMTx provides a flexible Monte Carlo sensitivity analysis approach to evaluate how causal conclusions would be altered in response to different magnitude of departure from ignorable treatment assignment.
Collapse
Affiliation(s)
- Lianyuan Hu
- Rutgers University School of Public Health, Department of Biostatistics and Epidemiology, 683 Hoes Lane West, Piscataway, NJ 08854, United States of America
| | - Jiayi Ji
- Rutgers University School of Public Health, Department of Biostatistics and Epidemiology 683 Hoes Lane West, Piscataway, NJ 08854, United States of America
| |
Collapse
|
7
|
Xu J, Guo Y, Wang F, Xu H, Lucero R, Bian J, Prosperi M. Protocol for the development of a reporting guideline for causal and counterfactual prediction models in biomedicine. BMJ Open 2022; 12:e059715. [PMID: 35725267 PMCID: PMC9214357 DOI: 10.1136/bmjopen-2021-059715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
INTRODUCTION While there are guidelines for reporting on observational studies (eg, Strengthening the Reporting of Observational Studies in Epidemiology, Reporting of Studies Conducted Using Observational Routinely Collected Health Data Statement), estimation of causal effects from both observational data and randomised experiments (eg, A Guideline for Reporting Mediation Analyses of Randomised Trials and Observational Studies, Consolidated Standards of Reporting Trials, PATH) and on prediction modelling (eg, Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis), none is purposely made for deriving and validating models from observational data to predict counterfactuals for individuals on one or more possible interventions, on the basis of given (or inferred) causal structures. This paper describes methods and processes that will be used to develop a Reporting Guideline for Causal and Counterfactual Prediction Models (PRECOG). METHODS AND ANALYSIS PRECOG will be developed following published guidance from the Enhancing the Quality and Transparency of Health Research (EQUATOR) network and will comprise five stages. Stage 1 will be meetings of a working group every other week with rotating external advisors (active until stage 5). Stage 2 will comprise a systematic review of literature on counterfactual prediction modelling for biomedical sciences (registered in Prospective Register of Systematic Reviews). In stage 3, a computer-based, real-time Delphi survey will be performed to consolidate the PRECOG checklist, involving experts in causal inference, epidemiology, statistics, machine learning, informatics and protocols/standards. Stage 4 will involve the write-up of the PRECOG guideline based on the results from the prior stages. Stage 5 will seek the peer-reviewed publication of the guideline, the scoping/systematic review and dissemination. ETHICS AND DISSEMINATION The study will follow the principles of the Declaration of Helsinki. The study has been registered in EQUATOR and approved by the University of Florida's Institutional Review Board (#202200495). Informed consent will be obtained from the working groups and the Delphi survey participants. The dissemination of PRECOG and its products will be done through journal publications, conferences, websites and social media.
Collapse
Affiliation(s)
- Jie Xu
- Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, Florida, USA
| | - Yi Guo
- Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, Florida, USA
| | - Fei Wang
- Department of Population Health Sciences, Weill Cornell Medical College, Cornell University, New York City, New York, USA
| | - Hua Xu
- School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, Texas, USA
| | - Robert Lucero
- School of Nursing, University of California - Los Angeles, Los Angeles, California, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, Florida, USA
| | - Mattia Prosperi
- Department of Epidemiology, University of Florida, Gainesville, Florida, USA
| |
Collapse
|
8
|
Hu L, Zou J, Gu C, Ji J, Lopez M, Kale M. A FLEXIBLE SENSITIVITY ANALYSIS APPROACH FOR UNMEASURED CONFOUNDING WITH MULTIPLE TREATMENTS AND A BINARY OUTCOME WITH APPLICATION TO SEER-MEDICARE LUNG CANCER DATA. Ann Appl Stat 2022; 16:1014-1037. [PMID: 36644682 PMCID: PMC9835106 DOI: 10.1214/21-aoas1530] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
In the absence of a randomized experiment, a key assumption for drawing causal inference about treatment effects is the ignorable treatment assignment. Violations of the ignorability assumption may lead to biased treatment effect estimates. Sensitivity analysis helps gauge how causal conclusions will be altered in response to the potential magnitude of departure from the ignorability assumption. However, sensitivity analysis approaches for unmeasured confounding in the context of multiple treatments and binary outcomes are scarce. We propose a flexible Monte Carlo sensitivity analysis approach for causal inference in such settings. We first derive the general form of the bias introduced by unmeasured confounding, with emphasis on theoretical properties uniquely relevant to multiple treatments. We then propose methods to encode the impact of unmeasured confounding on potential outcomes and adjust the estimates of causal effects in which the presumed unmeasured confounding is removed. Our proposed methods embed nested multiple imputation within the Bayesian framework, which allow for seamless integration of the uncertainty about the values of the sensitivity parameters and the sampling variability, as well as use of the Bayesian Additive Regression Trees for modeling flexibility. Expansive simulations validate our methods and gain insight into sensitivity analysis with multiple treatments. We use the SEER-Medicare data to demonstrate sensitivity analysis using three treatments for early stage non-small cell lung cancer. The methods developed in this work are readily available in the R package SAMTx.
Collapse
Affiliation(s)
- Liangyuan Hu
- Department of Biostatistics and Epidemiology, Rutgers University
| | - Jungang Zou
- Department of Biostatistics, Columbia University
| | | | - Jiayi Ji
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai
| | | | - Minal Kale
- Department of Medicine, Icahn School of Medicine at Mount Sinai
| |
Collapse
|
9
|
Lin JYJ, Hu L, Huang C, Jiayi J, Lawrence S, Govindarajulu U. A flexible approach for variable selection in large-scale healthcare database studies with missing covariate and outcome data. BMC Med Res Methodol 2022; 22:132. [PMID: 35508974 PMCID: PMC9066834 DOI: 10.1186/s12874-022-01608-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Accepted: 04/19/2022] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND Prior work has shown that combining bootstrap imputation with tree-based machine learning variable selection methods can provide good performances achievable on fully observed data when covariate and outcome data are missing at random (MAR). This approach however is computationally expensive, especially on large-scale datasets. METHODS We propose an inference-based method, called RR-BART, which leverages the likelihood-based Bayesian machine learning technique, Bayesian additive regression trees, and uses Rubin's rule to combine the estimates and variances of the variable importance measures on multiply imputed datasets for variable selection in the presence of MAR data. We conduct a representative simulation study to investigate the practical operating characteristics of RR-BART, and compare it with the bootstrap imputation based methods. We further demonstrate the methods via a case study of risk factors for 3-year incidence of metabolic syndrome among middle-aged women using data from the Study of Women's Health Across the Nation (SWAN). RESULTS The simulation study suggests that even in complex conditions of nonlinearity and nonadditivity with a large percentage of missingness, RR-BART can reasonably recover both prediction and variable selection performances, achievable on the fully observed data. RR-BART provides the best performance that the bootstrap imputation based methods can achieve with the optimal selection threshold value. In addition, RR-BART demonstrates a substantially stronger ability of detecting discrete predictors. Furthermore, RR-BART offers substantial computational savings. When implemented on the SWAN data, RR-BART adds to the literature by selecting a set of predictors that had been less commonly identified as risk factors but had substantial biological justifications. CONCLUSION The proposed variable selection method for MAR data, RR-BART, offers both computational efficiency and good operating characteristics and is utilitarian in large-scale healthcare database studies.
Collapse
Affiliation(s)
- Jung-Yi Joyce Lin
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, 1425 Madison Ave, New York, 10029, USA
| | - Liangyuan Hu
- Department of Biostatistics and Epidemiology, Rutgers University, 683 Hoes Lane West, Piscataway, 08854, USA.
| | - Chuyue Huang
- Primary Research Solution LLC., 115 W 18th St, New York, 10011, USA
| | - Ji Jiayi
- Department of Biostatistics and Epidemiology, Rutgers University, 683 Hoes Lane West, Piscataway, 08854, USA
| | - Steven Lawrence
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, 1425 Madison Ave, New York, 10029, USA
| | - Usha Govindarajulu
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, 1425 Madison Ave, New York, 10029, USA
| |
Collapse
|
10
|
Xu R, Chen G, Connor M, Murphy J. Novel Use of Patient-Specific Covariates From Oncology Studies in the Era of Biomedical Data Science: A Review of Latest Methodologies. J Clin Oncol 2022; 40:3546-3553. [PMID: 35258995 DOI: 10.1200/jco.21.01957] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
In this article, we review different applications of how to incorporate individual patient variables into clinical research within oncology. These methodologies range from the more traditional use of baseline covariates from randomized clinical trials, as well as observational studies, to using covariates to generalize the results of randomized clinical trials to other populations. Individual patient variables also allow for the consideration of heterogeneity in treatment effects and individualized treatment rules. We primarily consider two treatment groups and mostly focus on time-to-event outcomes where such methodologies have been well established and widely applied. We also discuss more conceptually newer statistical research that has not been widely applied in clinical oncology, but is likely to make an impact in future oncology research. With the increasing amount of biomedical data available for analysis, it is inevitable that more methods are developed to make best use of information, to advance oncology research.
Collapse
Affiliation(s)
- Ronghui Xu
- Univerity of California, San Diego, San Diego, CA
| | | | | | - James Murphy
- Univerity of California, San Diego, San Diego, CA
| |
Collapse
|
11
|
Hu L, Joyce Lin JY, Ji J. Variable selection with missing data in both covariates and outcomes: Imputation and machine learning. Stat Methods Med Res 2021; 30:2651-2671. [PMID: 34696650 PMCID: PMC11181487 DOI: 10.1177/09622802211046385] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Variable selection in the presence of both missing covariates and outcomes is an important statistical research topic. Parametric regression are susceptible to misspecification, and as a result are sub-optimal for variable selection. Flexible machine learning methods mitigate the reliance on the parametric assumptions, but do not provide as naturally defined variable importance measure as the covariate effect native to parametric models. We investigate a general variable selection approach when both the covariates and outcomes can be missing at random and have general missing data patterns. This approach exploits the flexibility of machine learning models and bootstrap imputation, which is amenable to nonparametric methods in which the covariate effects are not directly available. We conduct expansive simulations investigating the practical operating characteristics of the proposed variable selection approach, when combined with four tree-based machine learning methods, extreme gradient boosting, random forests, Bayesian additive regression trees, and conditional random forests, and two commonly used parametric methods, lasso and backward stepwise selection. Numeric results suggest that, extreme gradient boosting and Bayesian additive regression trees have the overall best variable selection performance with respect to the F 1 score and Type I error, while the lasso and backward stepwise selection have subpar performance across various settings. There is no significant difference in the variable selection performance due to imputation methods. We further demonstrate the methods via a case study of risk factors for 3-year incidence of metabolic syndrome with data from the Study of Women's Health Across the Nation.
Collapse
Affiliation(s)
- Liangyuan Hu
- Department of Biostatistics and Epidemiology, Rutgers University School of Public Health, USA
| | - Jung-Yi Joyce Lin
- Department of Population Health Science & Policy, Icahn School of Medicine at Mount Sinai, USA
| | - Jiayi Ji
- Department of Population Health Science & Policy, Icahn School of Medicine at Mount Sinai, USA
| |
Collapse
|