1
|
Berkowitz M, Altman RM, Loughin TM. Random forests for survival data: which methods work best and under what conditions? Int J Biostat 2024; 0:ijb-2023-0056. [PMID: 38656274 DOI: 10.1515/ijb-2023-0056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Accepted: 02/26/2024] [Indexed: 04/26/2024]
Abstract
Few systematic comparisons of methods for constructing survival trees and forests exist in the literature. Importantly, when the goal is to predict a survival time or estimate a survival function, the optimal choice of method is unclear. We use an extensive simulation study to systematically investigate various factors that influence survival forest performance - forest construction method, censoring, sample size, distribution of the response, structure of the linear predictor, and presence of correlated or noisy covariates. In particular, we study 11 methods that have recently been proposed in the literature and identify 6 top performers. We find that all the factors that we investigate have significant impact on the methods' relative accuracy of point predictions of survival times and survival function estimates. We use our results to make recommendations for which methods to use in a given context and offer explanations for the observed differences in relative performance.
Collapse
Affiliation(s)
- Matthew Berkowitz
- Statistics and Actuarial Science, Simon Fraser University, Burnaby, Canada
| | | | - Thomas M Loughin
- Statistics and Actuarial Science, Simon Fraser University, Burnaby, Canada
| |
Collapse
|
2
|
Alakus C, Larocque D, Labbe A. Covariance regression with random forests. BMC Bioinformatics 2023; 24:258. [PMID: 37330468 DOI: 10.1186/s12859-023-05377-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Accepted: 06/02/2023] [Indexed: 06/19/2023] Open
Abstract
Capturing the conditional covariances or correlations among the elements of a multivariate response vector based on covariates is important to various fields including neuroscience, epidemiology and biomedicine. We propose a new method called Covariance Regression with Random Forests (CovRegRF) to estimate the covariance matrix of a multivariate response given a set of covariates, using a random forest framework. Random forest trees are built with a splitting rule specially designed to maximize the difference between the sample covariance matrix estimates of the child nodes. We also propose a significance test for the partial effect of a subset of covariates. We evaluate the performance of the proposed method and significance test through a simulation study which shows that the proposed method provides accurate covariance matrix estimates and that the Type-1 error is well controlled. An application of the proposed method to thyroid disease data is also presented. CovRegRF is implemented in a freely available R package on CRAN.
Collapse
Affiliation(s)
- Cansu Alakus
- Department of Decision Sciences, HEC Montréal, Montréal, Canada.
| | - Denis Larocque
- Department of Decision Sciences, HEC Montréal, Montréal, Canada
| | - Aurélie Labbe
- Department of Decision Sciences, HEC Montréal, Montréal, Canada
| |
Collapse
|
3
|
Deep survival forests for extremely high censored data. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03846-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
4
|
Jia B, Zeng D, Liao JJZ, Liu GF, Tan X, Diao G, Ibrahim JG. Mixture survival trees for cancer risk classification. LIFETIME DATA ANALYSIS 2022; 28:356-379. [PMID: 35486260 PMCID: PMC10402927 DOI: 10.1007/s10985-022-09552-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Accepted: 03/04/2022] [Indexed: 06/14/2023]
Abstract
In oncology studies, it is important to understand and characterize disease heterogeneity among patients so that patients can be classified into different risk groups and one can identify high-risk patients at the right time. This information can then be used to identify a more homogeneous patient population for developing precision medicine. In this paper, we propose a mixture survival tree approach for direct risk classification. We assume that the patients can be classified into a pre-specified number of risk groups, where each group has distinct survival profile. Our proposed tree-based methods are devised to estimate latent group membership using an EM algorithm. The observed data log-likelihood function is used as the splitting criterion in recursive partitioning. The finite sample performance is evaluated by extensive simulation studies and the proposed method is illustrated by a case study in breast cancer.
Collapse
Affiliation(s)
- Beilin Jia
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
| | - Donglin Zeng
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | | | - Guanghan F Liu
- Biostatistics and Research Decision Sciences, Merck & Co., Inc, North Wales, PA, USA
| | - Xianming Tan
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Guoqing Diao
- Department of Biostatistics and Bioinformatics, The George Washington University, Washington, DC, USA
| | - Joseph G Ibrahim
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| |
Collapse
|
5
|
Alakuş C, Larocque D, Jacquemont S, Barlaam F, Martin CO, Agbogba K, Lippé S, Labbe A. Conditional canonical correlation estimation based on covariates with random forests. Bioinformatics 2021; 37:2714-2721. [PMID: 33693547 DOI: 10.1093/bioinformatics/btab158] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2020] [Revised: 02/03/2021] [Accepted: 03/03/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Investigating the relationships between two sets of variables helps to understand their interactions and can be done with canonical correlation analysis (CCA). However, the correlation between the two sets can sometimes depend on a third set of covariates, often subject-related ones such as age, gender, or other clinical measures. In this case, applying CCA to the whole population is not optimal and methods to estimate conditional CCA, given the covariates, can be useful. RESULTS We propose a new method called Random Forest with Canonical Correlation Analysis (RFCCA) to estimate the conditional canonical correlations between two sets of variables given subject-related covariates. The individual trees in the forest are built with a splitting rule specifically designed to partition the data to maximize the canonical correlation heterogeneity between child nodes. We also propose a significance test to detect the global effect of the covariates on the relationship between two sets of variables. The performance of the proposed method and the global significance test is evaluated through simulation studies that show it provides accurate canonical correlation estimations and well-controlled Type-1 error. We also show an application of the proposed method with EEG data. AVAILABILITY RFCCA is implemented in a freely available R package on CRAN (https://CRAN.R-project.org/package=RFCCA). SUPPLEMENTARY INFORMATION Supplementary material are available at Bioinformatics online.
Collapse
Affiliation(s)
- Cansu Alakuş
- Department of Decision Sciences, HEC Montréal, Montréal, QC H3T 2A7, Canada
| | - Denis Larocque
- Department of Decision Sciences, HEC Montréal, Montréal, QC H3T 2A7, Canada
| | - Sébastien Jacquemont
- Department of Pediatrics, Université de Montréal, Montréal, QC H3T 1C5, Canada.,CHU Sainte-Justine Research Center, Université de Montréal, Montréal, QC H3T 1C5, Canada
| | - Fanny Barlaam
- CHU Sainte-Justine Research Center, Université de Montréal, Montréal, QC H3T 1C5, Canada
| | - Charles-Olivier Martin
- CHU Sainte-Justine Research Center, Université de Montréal, Montréal, QC H3T 1C5, Canada
| | - Kristian Agbogba
- CHU Sainte-Justine Research Center, Université de Montréal, Montréal, QC H3T 1C5, Canada
| | - Sarah Lippé
- Department of Psychology, Université de Montréal, Montréal, QC H3T 1J4, Canada.,CHU Sainte-Justine Research Center, Université de Montréal, Montréal, QC H3T 1C5, Canada
| | - Aurélie Labbe
- Department of Decision Sciences, HEC Montréal, Montréal, QC H3T 2A7, Canada
| |
Collapse
|
6
|
Tabib S, Larocque D. Non-parametric individual treatment effect estimation for survival data with random forests. Bioinformatics 2020; 36:629-636. [PMID: 31373350 DOI: 10.1093/bioinformatics/btz602] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2019] [Revised: 07/06/2019] [Accepted: 07/30/2019] [Indexed: 01/19/2023] Open
Abstract
MOTIVATION Personalized medicine often relies on accurate estimation of a treatment effect for specific subjects. This estimation can be based on the subject's baseline covariates but additional complications arise for a time-to-event response subject to censoring. In this paper, the treatment effect is measured as the difference between the mean survival time of a treated subject and the mean survival time of a control subject. We propose a new random forest method for estimating the individual treatment effect with survival data. The random forest is formed by individual trees built with a splitting rule specifically designed to partition the data according to the individual treatment effect. For a new subject, the forest provides a set of similar subjects from the training dataset that can be used to compute an estimation of the individual treatment effect with any adequate method. RESULTS The merits of the proposed method are investigated with a simulation study where it is compared to numerous competitors, including recent state-of-the-art methods. The results indicate that the proposed method has a very good and stable performance to estimate the individual treatment effects. Two examples of application with a colon cancer data and breast cancer data show that the proposed method can detect a treatment effect in a sub-population even when the overall effect is small or nonexistent. AVAILABILITY AND IMPLEMENTATION The authors are working on an R package implementing the proposed method and it will be available soon. In the meantime, the code can be obtained from the first author at sami.tabib@hec.ca. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sami Tabib
- Department of Decision Sciences, HEC Montréal, Montréal, QC H3T 2A7, Canada
| | - Denis Larocque
- Department of Decision Sciences, HEC Montréal, Montréal, QC H3T 2A7, Canada
| |
Collapse
|
7
|
Korepanova N, Seibold H, Steffen V, Hothorn T. Survival forests under test: Impact of the proportional hazards assumption on prognostic and predictive forests for amyotrophic lateral sclerosis survival. Stat Methods Med Res 2020; 29:1403-1419. [PMID: 31304888 DOI: 10.1177/0962280219862586] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
We investigate the effect of the proportional hazards assumption on prognostic and predictive models of the survival time of patients suffering from amyotrophic lateral sclerosis. We theoretically compare the underlying model formulations of several variants of survival forests and implementations thereof, including random forests for survival, conditional inference forests, Ranger, and survival forests with L1 splitting, with two novel variants, namely distributional and transformation survival forests. Theoretical considerations explain the low power of log-rank-based splitting in detecting patterns in non-proportional hazards situations in survival trees and corresponding forests. This limitation can potentially be overcome by the alternative split procedures suggested herein. We empirically investigated this effect using simulation experiments and a re-analysis of the Pooled Resource Open-Access ALS Clinical Trials database of amyotrophic lateral sclerosis survival, giving special emphasis to both prognostic and predictive models.
Collapse
Affiliation(s)
- Natalia Korepanova
- International Laboratory for Intelligent Systems and Structural Analysis, Faculty of Computer Science, National Research University Higher School of Economics, Russia
| | - Heidi Seibold
- Institut für Statistik, Ludwig-Maximilians-Universität München, Germany
| | | | - Torsten Hothorn
- Institut für Epidemiologie, Biostatistik und Prävention, Universität Zürich, Switzerland
| |
Collapse
|
8
|
Schmid M, Welchowski T, Wright MN, Berger M. Discrete-time survival forests with Hellinger distance decision trees. Data Min Knowl Discov 2020. [DOI: 10.1007/s10618-020-00682-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
AbstractRandom survival forests (RSF) are a powerful nonparametric method for building prediction models with a time-to-event outcome. RSF do not rely on the proportional hazards assumption and can be readily applied to both low- and higher-dimensional data. A remaining limitation of RSF, however, arises from the fact that the method is almost entirely focussed on continuously measured event times. This issue may become problematic in studies where time is measured on a discrete scale $$t = 1, 2, ...$$
t
=
1
,
2
,
.
.
.
, referring to time intervals $$[0,a_1), [a_1,a_2), \ldots $$
[
0
,
a
1
)
,
[
a
1
,
a
2
)
,
…
. In this situation, the application of methods designed for continuous time-to-event data may lead to biased estimators and inaccurate predictions if discreteness is ignored. To address this issue, we develop a RSF algorithm that is specifically designed for the analysis of (possibly right-censored) discrete event times. The algorithm is based on an ensemble of discrete-time survival trees that operate on transformed versions of the original time-to-event data using tree methods for binary classification. As the outcome variable in these trees is typically highly imbalanced, our algorithm implements a node splitting strategy based on Hellinger’s distance, which is a skew-insensitive alternative to classical split criteria such as the Gini impurity. The new algorithm thus provides flexible nonparametric predictions of individual-specific discrete hazard and survival functions. Our numerical results suggest that node splitting by Hellinger’s distance improves predictive performance when compared to the Gini impurity. Furthermore, discrete-time RSF improve prediction accuracy when compared to RSF approaches treating discrete event times as continuous in situations where the number of time intervals is small.
Collapse
|
9
|
Sun Y, Chiou SH, Wang MC. ROC-guided survival trees and ensembles. Biometrics 2020; 76:1177-1189. [PMID: 31880315 DOI: 10.1111/biom.13213] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2018] [Revised: 10/01/2019] [Accepted: 12/17/2019] [Indexed: 01/08/2023]
Abstract
Tree-based methods are popular nonparametric tools in studying time-to-event outcomes. In this article, we introduce a novel framework for survival trees and ensembles, where the trees partition the dynamic survivor population and can handle time-dependent covariates. Using the idea of randomized tests, we develop generalized time-dependent receiver operating characteristic (ROC) curves for evaluating the performance of survival trees. The tree-building algorithm is guided by decision-theoretic criteria based on ROC, targeting specifically for prediction accuracy. To address the instability issue of a single tree, we propose a novel ensemble procedure based on averaging martingale estimating equations, which is different from existing methods that average the predicted survival or cumulative hazard functions from individual trees. Extensive simulation studies are conducted to examine the performance of the proposed methods. We apply the methods to a study on AIDS for illustration.
Collapse
Affiliation(s)
- Yifei Sun
- Department of Biostatistics, Columbia Mailman School of Public Health, New York, New York
| | - Sy Han Chiou
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, Texas
| | - Mei-Cheng Wang
- Department of Biotatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland
| |
Collapse
|
10
|
Wongvibulsin S, Wu KC, Zeger SL. Clinical risk prediction with random forests for survival, longitudinal, and multivariate (RF-SLAM) data analysis. BMC Med Res Methodol 2019; 20:1. [PMID: 31888507 PMCID: PMC6937754 DOI: 10.1186/s12874-019-0863-0] [Citation(s) in RCA: 62] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2019] [Accepted: 11/08/2019] [Indexed: 12/23/2022] Open
Abstract
Background Clinical research and medical practice can be advanced through the prediction of an individual’s health state, trajectory, and responses to treatments. However, the majority of current clinical risk prediction models are based on regression approaches or machine learning algorithms that are static, rather than dynamic. To benefit from the increasing emergence of large, heterogeneous data sets, such as electronic health records (EHRs), novel tools to support improved clinical decision making through methods for individual-level risk prediction that can handle multiple variables, their interactions, and time-varying values are necessary. Methods We introduce a novel dynamic approach to clinical risk prediction for survival, longitudinal, and multivariate (SLAM) outcomes, called random forest for SLAM data analysis (RF-SLAM). RF-SLAM is a continuous-time, random forest method for survival analysis that combines the strengths of existing statistical and machine learning methods to produce individualized Bayes estimates of piecewise-constant hazard rates. We also present a method-agnostic approach for time-varying evaluation of model performance. Results We derive and illustrate the method by predicting sudden cardiac arrest (SCA) in the Left Ventricular Structural (LV) Predictors of Sudden Cardiac Death (SCD) Registry. We demonstrate superior performance relative to standard random forest methods for survival data. We illustrate the importance of the number of preceding heart failure hospitalizations as a time-dependent predictor in SCA risk assessment. Conclusions RF-SLAM is a novel statistical and machine learning method that improves risk prediction by incorporating time-varying information and accommodating a large number of predictors, their interactions, and missing values. RF-SLAM is designed to easily extend to simultaneous predictions of multiple, possibly competing, events and/or repeated measurements of discrete or continuous variables over time.Trial registration: LV Structural Predictors of SCD Registry (clinicaltrials.gov, NCT01076660), retrospectively registered 25 February 2010
Collapse
Affiliation(s)
- Shannon Wongvibulsin
- Department of Biomedical Engineering, Johns Hopkins School of Medicine, Baltimore, USA.
| | - Katherine C Wu
- Department of Medicine, Division of Cardiology, Johns Hopkins School of Medicine, Baltimore, USA
| | - Scott L Zeger
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, USA
| |
Collapse
|
11
|
Abstract
The classical and most commonly used approach to building prediction intervals is the parametric approach. However, its main drawback is that its validity and performance highly depend on the assumed functional link between the covariates and the response. This research investigates new methods that improve the performance of prediction intervals with random forests. Two aspects are explored: The method used to build the forest and the method used to build the prediction interval. Four methods to build the forest are investigated, three from the classification and regression tree (CART) paradigm and the transformation forest method. For CART forests, in addition to the default least-squares splitting rule, two alternative splitting criteria are investigated. We also present and evaluate the performance of five flexible methods for constructing prediction intervals. This yields 20 distinct method variations. To reliably attain the desired confidence level, we include a calibration procedure performed on the out-of-bag information provided by the forest. The 20 method variations are thoroughly investigated, and compared to five alternative methods through simulation studies and in real data settings. The results show that the proposed methods are very competitive. They outperform commonly used methods in both in simulation settings and with real data.
Collapse
|
12
|
Nasejje JB, Mwambi H. Application of random survival forests in understanding the determinants of under-five child mortality in Uganda in the presence of covariates that satisfy the proportional and non-proportional hazards assumption. BMC Res Notes 2017; 10:459. [PMID: 28882171 PMCID: PMC5590231 DOI: 10.1186/s13104-017-2775-6] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2016] [Accepted: 08/31/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Uganda just like any other Sub-Saharan African country, has a high under-five child mortality rate. To inform policy on intervention strategies, sound statistical methods are required to critically identify factors strongly associated with under-five child mortality rates. The Cox proportional hazards model has been a common choice in analysing data to understand factors strongly associated with high child mortality rates taking age as the time-to-event variable. However, due to its restrictive proportional hazards (PH) assumption, some covariates of interest which do not satisfy the assumption are often excluded in the analysis to avoid mis-specifying the model. Otherwise using covariates that clearly violate the assumption would mean invalid results. METHODS Survival trees and random survival forests are increasingly becoming popular in analysing survival data particularly in the case of large survey data and could be attractive alternatives to models with the restrictive PH assumption. In this article, we adopt random survival forests which have never been used in understanding factors affecting under-five child mortality rates in Uganda using Demographic and Health Survey data. Thus the first part of the analysis is based on the use of the classical Cox PH model and the second part of the analysis is based on the use of random survival forests in the presence of covariates that do not necessarily satisfy the PH assumption. RESULTS Random survival forests and the Cox proportional hazards model agree that the sex of the household head, sex of the child, number of births in the past 1 year are strongly associated to under-five child mortality in Uganda given all the three covariates satisfy the PH assumption. Random survival forests further demonstrated that covariates that were originally excluded from the earlier analysis due to violation of the PH assumption were important in explaining under-five child mortality rates. These covariates include the number of children under the age of five in a household, number of births in the past 5 years, wealth index, total number of children ever born and the child's birth order. The results further indicated that the predictive performance for random survival forests built using covariates including those that violate the PH assumption was higher than that for random survival forests built using only covariates that satisfy the PH assumption. CONCLUSIONS Random survival forests are appealing methods in analysing public health data to understand factors strongly associated with under-five child mortality rates especially in the presence of covariates that violate the proportional hazards assumption.
Collapse
Affiliation(s)
- Justine B Nasejje
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, 250 King Edward Avenue, Scottsville, Pietermaritzburg, 3201, South Africa.
| | - Henry Mwambi
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Scottsville, Pietermaritzburg, 3209, South Africa
| |
Collapse
|
13
|
Nasejje JB, Mwambi H, Dheda K, Lesosky M. A comparison of the conditional inference survival forest model to random survival forests based on a simulation study as well as on two applications with time-to-event data. BMC Med Res Methodol 2017; 17:115. [PMID: 28754093 PMCID: PMC5534080 DOI: 10.1186/s12874-017-0383-8] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2017] [Accepted: 06/30/2017] [Indexed: 11/13/2022] Open
Abstract
Background Random survival forest (RSF) models have been identified as alternative methods to the Cox proportional hazards model in analysing time-to-event data. These methods, however, have been criticised for the bias that results from favouring covariates with many split-points and hence conditional inference forests for time-to-event data have been suggested. Conditional inference forests (CIF) are known to correct the bias in RSF models by separating the procedure for the best covariate to split on from that of the best split point search for the selected covariate. Methods In this study, we compare the random survival forest model to the conditional inference model (CIF) using twenty-two simulated time-to-event datasets. We also analysed two real time-to-event datasets. The first dataset is based on the survival of children under-five years of age in Uganda and it consists of categorical covariates with most of them having more than two levels (many split-points). The second dataset is based on the survival of patients with extremely drug resistant tuberculosis (XDR TB) which consists of mainly categorical covariates with two levels (few split-points). Results The study findings indicate that the conditional inference forest model is superior to random survival forest models in analysing time-to-event data that consists of covariates with many split-points based on the values of the bootstrap cross-validated estimates for integrated Brier scores. However, conditional inference forests perform comparably similar to random survival forests models in analysing time-to-event data consisting of covariates with fewer split-points. Conclusion Although survival forests are promising methods in analysing time-to-event data, it is important to identify the best forest model for analysis based on the nature of covariates of the dataset in question.
Collapse
Affiliation(s)
- Justine B Nasejje
- School of Statistics, Mathematics and Computer Science, University of Kwazulu-Natal, Pietermaritzburg, South Africa.
| | - Henry Mwambi
- School of Statistics, Mathematics and Computer Science, University of Kwazulu-Natal, Pietermaritzburg, South Africa
| | - Keertan Dheda
- Division of Pulmonology and UCT Lung Institute, Department of Medicine, University of Cape Town, Cape Town, South Africa
| | - Maia Lesosky
- Division of Epidemiology and Biostatistics, School of Public Health and Family Medicine, University of Cape Town, Cape Town, South Africa
| |
Collapse
|
14
|
Wang H, Li G. A Selective Review on Random Survival Forests for High Dimensional Data. QUANTITATIVE BIO-SCIENCE 2017; 36:85-96. [PMID: 30740388 PMCID: PMC6364686 DOI: 10.22283/qbs.2017.36.2.85] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Over the past decades, there has been considerable interest in applying statistical machine learning methods in survival analysis. Ensemble based approaches, especially random survival forests, have been developed in a variety of contexts due to their high precision and non-parametric nature. This article aims to provide a timely review on recent developments and applications of random survival forests for time-to-event data with high dimensional covariates. This selective review begins with an introduction to the random survival forest framework, followed by a survey of recent developments on splitting criteria, variable selection, and other advanced topics of random survival forests for time-to-event data in high dimensional settings. We also discuss potential research directions for future research.
Collapse
Affiliation(s)
- Hong Wang
- School of Mathematics and Statistics, Central South University, Hunan 410083, China
| | - Gang Li
- Department of Biostatistics and Biomathematics, School of Public Health, University of California at Los Angeles, CA 90095, USA
| |
Collapse
|