1
|
Li S, Ni Z, Zhao Y, Hu W, Long Z, Ma H, Zhou G, Luo Y, Geng C. Susceptibility Analysis of Geohazards in the Longmen Mountain Region after the Wenchuan Earthquake. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2022; 19:ijerph19063229. [PMID: 35328915 PMCID: PMC8953272 DOI: 10.3390/ijerph19063229] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/24/2021] [Revised: 03/02/2022] [Accepted: 03/04/2022] [Indexed: 12/10/2022]
Abstract
Multitemporal geohazard susceptibility analysis can not only provide reliable results but can also help identify the differences in the mechanisms of different elements under different temporal and spatial backgrounds, so as to better accurately prevent and control geohazards. Here, we studied the 12 counties (cities) that were severely affected by the Wenchuan earthquake of 12 May 2008. Our study was divided into four time periods: 2008, 2009–2012, 2013, and 2014–2017. Common geohazards in the study area, such as landslides, collapses and debris flows, were taken into account. We constructed a geohazard susceptibility index evaluation system that included topography, geology, land cover, meteorology, hydrology, and human activities. Then we used a random forest model to study the changes in geohazard susceptibility during the Wenchuan earthquake, the following ten years, and its driving mechanisms. We had four main findings. (1) The susceptibility of geohazards from 2008 to 2017 gradually increased and their spatial distribution was significantly correlated with the main faults and rivers. (2) The Yingxiu-Beichuan Fault, the western section of the Jiangyou-Dujiangyan Fault, and the Minjiang and Fujiang rivers were highly susceptible to geohazards, and changes in geohazard susceptibility mainly occurred along the Pingwu-Qingchuan Fault, the eastern section of the Jiangyou-Dujiangyan Fault, and the riparian areas of the Mianyuan River, Zagunao River, Tongkou River, Baicao River, and other secondary rivers. (3) The relative contribution of topographic factors to geohazards in the four different periods was stable, geological factors slowly decreased, and meteorological and hydrological factors increased. In addition, the impact of land cover in 2008 was more significant than during other periods, and the impact of human activities had an upward trend from 2008 to 2017. (4) Elevation and slope had significant topographical effects, coupled with the geological environmental effects of engineering rock groups and faults, and river-derived effects, which resulted in a spatial aggregation of geohazard susceptibility. We attributed the dynamic changes in the areas that were highly susceptible to geohazards around the faults and rivers to the changes in the intensity of earthquakes and precipitation in different periods.
Collapse
Affiliation(s)
- Shuai Li
- College of Tourism and Urban-Rural Planning, Chengdu University of Technology, Chengdu 610059, China; (S.L.); (Y.Z.); (W.H.); (G.Z.); (Y.L.); (C.G.)
| | - Zhongyun Ni
- College of Earth Sciences, Chengdu University of Technology, Chengdu 610059, China
- School of Geography, Archaeology & Irish Studies, National University of Ireland, H91 CF50 Galway, Ireland
- Correspondence:
| | - Yinbing Zhao
- College of Tourism and Urban-Rural Planning, Chengdu University of Technology, Chengdu 610059, China; (S.L.); (Y.Z.); (W.H.); (G.Z.); (Y.L.); (C.G.)
- School of Geography, Archaeology & Irish Studies, National University of Ireland, H91 CF50 Galway, Ireland
- Human geography research center of Qinghai Tibet Plateau and its eastern margin, Chengdu University of Technology, Chengdu 610059, China
| | - Wei Hu
- College of Tourism and Urban-Rural Planning, Chengdu University of Technology, Chengdu 610059, China; (S.L.); (Y.Z.); (W.H.); (G.Z.); (Y.L.); (C.G.)
| | - Zhenrui Long
- Sichuan Research Institute of Ecological Restoration of Land Space and Geohazard Prevention and Control, Sichuan Provincial Department of Natural Resources, Chengdu 610063, China;
| | - Haiyu Ma
- College of Information, Shanghai Ocean University, Shanghai 201306, China;
| | - Guoli Zhou
- College of Tourism and Urban-Rural Planning, Chengdu University of Technology, Chengdu 610059, China; (S.L.); (Y.Z.); (W.H.); (G.Z.); (Y.L.); (C.G.)
| | - Yuhao Luo
- College of Tourism and Urban-Rural Planning, Chengdu University of Technology, Chengdu 610059, China; (S.L.); (Y.Z.); (W.H.); (G.Z.); (Y.L.); (C.G.)
| | - Chuntao Geng
- College of Tourism and Urban-Rural Planning, Chengdu University of Technology, Chengdu 610059, China; (S.L.); (Y.Z.); (W.H.); (G.Z.); (Y.L.); (C.G.)
| |
Collapse
|
2
|
Zhang L, Kim I. Finite mixtures of semiparametric Bayesian survival kernel machine regressions: Application to breast cancer gene pathway subgroup analysis. J R Stat Soc Ser C Appl Stat 2020. [DOI: 10.1111/rssc.12457] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Lin Zhang
- Department of Statistics Virginia Tech Blacksburg VAUSA
| | - Inyoung Kim
- Department of Statistics Virginia Tech Blacksburg VAUSA
| |
Collapse
|
3
|
Yan KK, Wang X, Lam WWT, Vardhanabhuti V, Lee AWM, Pang HH. Radiomics analysis using stability selection supervised component analysis for right-censored survival data. Comput Biol Med 2020; 124:103959. [PMID: 32905923 PMCID: PMC7501167 DOI: 10.1016/j.compbiomed.2020.103959] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2020] [Revised: 08/02/2020] [Accepted: 08/03/2020] [Indexed: 02/03/2023]
Abstract
Radiomics is a newly emerging field that involves the extraction of massive quantitative features from biomedical images by using data-characterization algorithms. Distinctive imaging features identified from biomedical images can be used for prognosis and therapeutic response prediction, and they can provide a noninvasive approach for personalized therapy. So far, many of the published radiomics studies utilize existing out of the box algorithms to identify the prognostic markers from biomedical images that are not specific to radiomics data. To better utilize biomedical images, we propose a novel machine learning approach, stability selection supervised principal component analysis (SSSuperPCA) that identifies stable features from radiomics big data coupled with dimension reduction for right-censored survival outcomes. The proposed approach allows us to identify a set of stable features that are highly associated with the survival outcomes in a simple yet meaningful manner, while controlling the per-family error rate. We evaluate the performance of SSSuperPCA using simulations and real data sets for non-small cell lung cancer and head and neck cancer, and compare it with other machine learning algorithms. The results demonstrate that our method has a competitive edge over other existing methods in identifying the prognostic markers from biomedical imaging data for the prediction of right-censored survival outcomes.
Collapse
Affiliation(s)
- Kang K Yan
- School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Xiaofei Wang
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, USA
| | - Wendy W T Lam
- School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Jockey Club Institute of Cancer Care, Li Ka Shing Faculty of Medicine, Hong Kong SAS, China
| | - Varut Vardhanabhuti
- Department of Diagnostic Radiology, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Anne W M Lee
- Department of Clinical Oncology, The University of Hong Kong-Shenzhen Hospital and The University of Hong Kong, Hong Kong SAR, China
| | - Herbert H Pang
- School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China.
| |
Collapse
|
4
|
Wang Y, Sun D, Wen H, Zhang H, Zhang F. Comparison of Random Forest Model and Frequency Ratio Model for Landslide Susceptibility Mapping (LSM) in Yunyang County (Chongqing, China). INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2020; 17:ijerph17124206. [PMID: 32545618 PMCID: PMC7345078 DOI: 10.3390/ijerph17124206] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/12/2020] [Revised: 06/09/2020] [Accepted: 06/10/2020] [Indexed: 12/05/2022]
Abstract
To compare the random forest (RF) model and the frequency ratio (FR) model for landslide susceptibility mapping (LSM), this research selected Yunyang Country as the study area for its frequent natural disasters; especially landslides. A landslide inventory was built by historical records; satellite images; and extensive field surveys. Subsequently; a geospatial database was established based on 987 historical landslides in the study area. Then; all the landslides were randomly divided into two datasets: 70% of them were used as the training dataset and 30% as the test dataset. Furthermore; under five primary conditioning factors (i.e., topography factors; geological factors; environmental factors; human engineering activities; and triggering factors), 22 secondary conditioning factors were selected to form an evaluation factor library for analyzing the landslide susceptibility. On this basis; the RF model training and the FR model mathematical analysis were performed; and the established models were used for the landslide susceptibility simulation in the entire area of Yunyang County. Next; based on the analysis results; the susceptibility maps were divided into five classes: very low; low; medium; high; and very high. In addition; the importance of conditioning factors was ranked and the influence of landslides was explored by using the RF model. The area under the curve (AUC) value of receiver operating characteristic (ROC) curve; precision; accuracy; and recall ratio were used to analyze the predictive ability of the above two LSM models. The results indicated a difference in the performances between the two models. The RF model (AUC = 0.988) performed better than the FR model (AUC = 0.716). Moreover; compared with the FR model; the RF model showed a higher coincidence degree between the areas in the high and the very low susceptibility classes; on the one hand; and the geographical spatial distribution of historical landslides; on the other hand. Therefore; it was concluded that the RF model was more suitable for landslide susceptibility evaluation in Yunyang County; because of its significant model performance; reliability; and stability. The outcome also provided a theoretical basis for application of machine learning techniques (e.g., RF) in landslide prevention; mitigation; and urban planning; so as to deliver an adequate response to the increasing demand for effective and low-cost tools in landslide susceptibility assessments.
Collapse
Affiliation(s)
- Yue Wang
- The Key Laboratory of GIS Application Research, Chongqing Normal University, Chongqing 401331, China; (Y.W.); (H.Z.)
| | - Deliang Sun
- The Key Laboratory of GIS Application Research, Chongqing Normal University, Chongqing 401331, China; (Y.W.); (H.Z.)
- Correspondence: (D.S.); (H.W.); Tel.: +86-158-2356-5622 (D.S.); +86-132-5132-1327 (H.W.)
| | - Haijia Wen
- Key Laboratory of New Technology for Construction of Cities in Mountain Area, Ministry of Education, Chongqing 400045, China
- National Joint Engineering Research Center of Geohazards Prevention in the Reservoir Areas, Chongqing 400044, China
- School of Civil Engineering, Chongqing University, Chongqing 400045, China
- Correspondence: (D.S.); (H.W.); Tel.: +86-158-2356-5622 (D.S.); +86-132-5132-1327 (H.W.)
| | - Hong Zhang
- The Key Laboratory of GIS Application Research, Chongqing Normal University, Chongqing 401331, China; (Y.W.); (H.Z.)
| | - Fengtai Zhang
- School of Management, Chongqing University of Technology, Chongqing 400054, China;
| |
Collapse
|
5
|
Predictive Features of Thymic Carcinoma and High-Risk Thymomas Using Random Forest Analysis. J Comput Assist Tomogr 2020; 44:857-864. [PMID: 31996651 DOI: 10.1097/rct.0000000000000953] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
PURPOSE To determine the predictive features of thymic carcinomas and high-risk thymomas using random forest algorithm. METHODS A total of 137 patients with pathologically confirmed high-risk thymomas and thymic carcinomas were enrolled in this study. Three clinical features and 20 computed tomography features were reviewed. The association between computed tomography features and pathological patterns was analyzed by univariate analysis and random forest. The predictive efficiency of the random forest algorithm was evaluated by receiver operating characteristic curve analysis. RESULTS There were 92 thymic carcinomas and 45 high-risk thymomas in this study. In univariate analysis, patient age, presence of myasthenia gravis, lesion shape, enhancement pattern, presence of necrosis or cystic change, mediastinal invasion, vessel invasion, lymphadenopathy, pericardial effusion, and distant organ metastasis were found to be statistically different between high-risk thymomas and thymic carcinomas (all P < 0.01). Random forest suggested that tumor shape, lymphadenopathy, and the presence of pericardial effusion were the key features in tumor differentiation. The predictive accuracy for the test data and whole data was 94.73% and 96.35%, respectively. Further receiver operating characteristic curve analysis showed the area under the curve was 0.957 (95% confidence interval, 0.986-0.929). CONCLUSIONS The random forest model in the present study has high efficiency in predictive diagnosis of thymic carcinomas and high-risk thymomas. Tumor shape, lymphadenopathy, and pericardial effusion are the key features for tumor differentiation. Thymic tumors with irregular shape, the presence of lymphadenopathy, and pericardial effusion are highly indicative of thymic carcinomas.
Collapse
|
6
|
Hosni M, Abnane I, Idri A, Carrillo de Gea JM, Fernández Alemán JL. Reviewing ensemble classification methods in breast cancer. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019; 177:89-112. [PMID: 31319964 DOI: 10.1016/j.cmpb.2019.05.019] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/07/2019] [Revised: 05/16/2019] [Accepted: 05/18/2019] [Indexed: 05/09/2023]
Abstract
CONTEXT Ensemble methods consist of combining more than one single technique to solve the same task. This approach was designed to overcome the weaknesses of single techniques and consolidate their strengths. Ensemble methods are now widely used to carry out prediction tasks (e.g. classification and regression) in several fields, including that of bioinformatics. Researchers have particularly begun to employ ensemble techniques to improve research into breast cancer, as this is the most frequent type of cancer and accounts for most of the deaths among women. OBJECTIVE AND METHOD The goal of this study is to analyse the state of the art in ensemble classification methods when applied to breast cancer as regards 9 aspects: publication venues, medical tasks tackled, empirical and research types adopted, types of ensembles proposed, single techniques used to construct the ensembles, validation framework adopted to evaluate the proposed ensembles, tools used to build the ensembles, and optimization methods used for the single techniques. This paper was undertaken as a systematic mapping study. RESULTS A total of 193 papers that were published from the year 2000 onwards, were selected from four online databases: IEEE Xplore, ACM digital library, Scopus and PubMed. This study found that of the six medical tasks that exist, the diagnosis medical task was that most frequently researched, and that the experiment-based empirical type and evaluation-based research type were the most dominant approaches adopted in the selected studies. The homogeneous type was that most widely used to perform the classification task. With regard to single techniques, this mapping study found that decision trees, support vector machines and artificial neural networks were those most frequently adopted to build ensemble classifiers. In the case of the evaluation framework, the Wisconsin Breast Cancer dataset was the most frequently used by researchers to perform their experiments, while the most noticeable validation method was k-fold cross-validation. Several tools are available to perform experiments related to ensemble classification methods, such as Weka and R Software. Few researchers took into account the optimisation of the single technique of which their proposed ensemble was composed, while the grid search method was that most frequently adopted to tune the parameter settings of a single classifier. CONCLUSION This paper reports an in-depth study of the application of ensemble methods as regards breast cancer. Our results show that there are several gaps and issues and we, therefore, provide researchers in the field of breast cancer research with recommendations. Moreover, after analysing the papers found in this systematic mapping study, we discovered that the majority report positive results concerning the accuracy of ensemble classifiers when compared to the single classifiers. In order to aggregate the evidence reported in literature, it will, therefore, be necessary to perform a systematic literature review and meta-analysis in which an in-depth analysis could be conducted so as to confirm the superiority of ensemble classifiers over the classical techniques.
Collapse
Affiliation(s)
- Mohamed Hosni
- Software Project Management Research Team, ENSIAS, University Mohammed V of Rabat, Morocco.
| | - Ibtissam Abnane
- Software Project Management Research Team, ENSIAS, University Mohammed V of Rabat, Morocco.
| | - Ali Idri
- Software Project Management Research Team, ENSIAS, University Mohammed V of Rabat, Morocco.
| | - Juan M Carrillo de Gea
- Department of Informatics and Systems, Faculty of Computer Science, University of Murcia, Spain.
| | | |
Collapse
|
7
|
Dereli O, Oğuz C, Gönen M. Path2Surv: Pathway/gene set-based survival analysis using multiple kernel learning. Bioinformatics 2019; 35:5137-5145. [DOI: 10.1093/bioinformatics/btz446] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2018] [Revised: 05/17/2019] [Accepted: 05/25/2019] [Indexed: 12/18/2022] Open
Abstract
Abstract
Motivation
Survival analysis methods that integrate pathways/gene sets into their learning model could identify molecular mechanisms that determine survival characteristics of patients. Rather than first picking the predictive pathways/gene sets from a given collection and then training a predictive model on the subset of genomic features mapped to these selected pathways/gene sets, we developed a novel machine learning algorithm (Path2Surv) that conjointly performs these two steps using multiple kernel learning.
Results
We extensively tested our Path2Surv algorithm on 7655 patients from 20 cancer types using cancer-specific pathway/gene set collections and gene expression profiles of these patients. Path2Surv statistically significantly outperformed survival random forest (RF) on 12 out of 20 datasets and obtained comparable predictive performance against survival support vector machine (SVM) using significantly fewer gene expression features (i.e. less than 10% of what survival RF and survival SVM used).
Availability and implementation
Our implementations of survival SVM and Path2Surv algorithms in R are available at https://github.com/mehmetgonen/path2surv together with the scripts that replicate the reported experiments.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Onur Dereli
- Graduate School of Sciences and Engineering, İstanbul 34450, Turkey
| | - Ceyda Oğuz
- Department of Industrial Engineering, College of Engineering, İstanbul 34450, Turkey
| | - Mehmet Gönen
- Department of Industrial Engineering, College of Engineering, İstanbul 34450, Turkey
- School of Medicine, Koc¸ University, İstanbul 34450, Turkey
- Department of Biomedical Engineering, School of Medicine, Oregon Health & Science University, Portland, OR 97239, USA
| |
Collapse
|
8
|
Sun J, Herazo-Maya JD, Wang JL, Kaminski N, Zhao H. LCox: a tool for selecting genes related to survival outcomes using longitudinal gene expression data. Stat Appl Genet Mol Biol 2019; 18:/j/sagmb.ahead-of-print/sagmb-2017-0060/sagmb-2017-0060.xml. [PMID: 30759070 DOI: 10.1515/sagmb-2017-0060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Longitudinal genomics data and survival outcome are common in biomedical studies, where the genomics data are often of high dimension. It is of great interest to select informative longitudinal biomarkers (e.g. genes) related to the survival outcome. In this paper, we develop a computationally efficient tool, LCox, for selecting informative biomarkers related to the survival outcome using the longitudinal genomics data. LCox is powerful to detect different forms of dependence between the longitudinal biomarkers and the survival outcome. We show that LCox has improved performance compared to existing methods through extensive simulation studies. In addition, by applying LCox to a dataset of patients with idiopathic pulmonary fibrosis, we are able to identify biologically meaningful genes while all other methods fail to make any discovery. An R package to perform LCox is freely available at https://CRAN.R-project.org/package=LCox.
Collapse
Affiliation(s)
- Jiehuan Sun
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Jose D Herazo-Maya
- Internal Medicine: Pulmonary, Critical Care and Sleep Medicine, Yale School of Medcine, New Haven, CT, USA
| | - Jane-Ling Wang
- Department of Statistics, University of California, Davis, CA, USA
| | - Naftali Kaminski
- Internal Medicine: Pulmonary, Critical Care and Sleep Medicine, Yale School of Medcine, New Haven, CT, USA
| | - Hongyu Zhao
- Department of Biostatistics, Yale School of Public Health, 60 College Street, New Haven, CT 06510, USA
| |
Collapse
|
9
|
Wu Q, Wang H, Yan X, Liu X. MapReduce-based adaptive random forest algorithm for multi-label classification. Neural Comput Appl 2018. [DOI: 10.1007/s00521-018-3900-8] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
10
|
Zhang L, Kim I. Semiparametric Bayesian kernel survival model for evaluating pathway effects. Stat Methods Med Res 2018; 28:3301-3317. [PMID: 30289021 DOI: 10.1177/0962280218797360] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
Massive amounts of high-dimensional data have been accumulated over the past two decades, which has cultured increasing interests in identifying gene pathways related to certain biological processes. In particular, since pathway-based analysis has the ability to detect subtle changes of differentially expressed genes that could be missed when using gene-based analysis, detecting the gene pathways that regulate certain diseases can provide new strategies for medical procedures and new targets for drug discovery. Limited work has been carried out, primarily in regression settings, to study the effects of pathways on survival outcomes. Motivated by a breast cancer gene-pathway data set, which exhibits the "small n, large p" characteristics, we propose a semiparametric Bayesian kernel survival model (s-BKSurv) to study the effects of both clinical covariates and gene expression levels within a pathway on survival time. We model the unknown high-dimensional functions of pathways via Gaussian kernel machine to consider the possibility that genes within the same pathway interact with each other. To address the multiple comparisons problem under a full Bayesian setting, we propose a similarity-dependent procedure based on Bayes factor to control the family-wise error rate. We demonstrate the outperformance of our approach under various simulation settings and pathways data.
Collapse
Affiliation(s)
- Lin Zhang
- Department of Statistics, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA
| | - Inyoung Kim
- Department of Statistics, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA
| |
Collapse
|
11
|
Wang W, Liu W. Integration of gene interaction information into a reweighted random survival forest approach for accurate survival prediction and survival biomarker discovery. Sci Rep 2018; 8:13202. [PMID: 30181543 PMCID: PMC6123437 DOI: 10.1038/s41598-018-31497-0] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2017] [Accepted: 08/20/2018] [Indexed: 02/05/2023] Open
Abstract
Accurately predicting patient risk and identifying survival biomarkers are two important tasks in survival analysis. For the emerging high-throughput gene expression data, random survival forest (RSF) is attracting more and more attention as it not only shows excellent performance on survival prediction problems with high-dimensional variables, but also is capable of identifying important variables according to variable importance automatically calculated within the algorithm. However, RSF still suffers from some problems such as limited predictive accuracy on independent datasets and limited biological interpretation of survival biomarkers. In this study, we integrated gene interaction information into a Reweighted RSF model (RRSF) to improve predictive accuracy and identify biologically meaningful survival markers. We applied RRSF to the prediction of patients with glioblastoma multiforme (GBM) and esophageal squamous cell carcinoma (ESCC). With a reconstructed global pathway network and an mRNA-lncRNA co-expression network as the prior gene interaction information, RRSF showed better overall predictive performance than RSF on three GBM and two ESCC datasets. In addition, RRSF identified a two-gene and three-lncRNA signature, which showed robust prognostic values and had high biological relevance to the development of GBM and ESCC, respectively.
Collapse
Affiliation(s)
- Wei Wang
- Department of Mathematics, Heilongjiang Institute of Technology, Harbin, 150050, China
| | - Wei Liu
- Department of Mathematics, Heilongjiang Institute of Technology, Harbin, 150050, China.
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, 515041, China.
| |
Collapse
|
12
|
Huang Z, Huang C, Xie J, Ma J, Cao G, Huang Q, Shen B, Byers Kraus V, Pei F. Analysis of a large data set to identify predictors of blood transfusion in primary total hip and knee arthroplasty. Transfusion 2018; 58:1855-1862. [PMID: 30145838 DOI: 10.1111/trf.14783] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2017] [Revised: 03/05/2018] [Accepted: 03/05/2018] [Indexed: 02/05/2023]
Abstract
BACKGROUND The aim of this study was to identify the predictors of need for allogenic blood transfusion (ALBT) in primary lower limb total joint arthroplasty (TJA). STUDY DESIGN AND METHODS This study utilized a large dataset of 15,187 patients undergoing primary unilateral TJA. Risk factors and demographic information were extracted from the electronic health record. A predictive model was developed by both a random forest (RF) algorithm and logistic regression (LR). The area under the receiver operating characteristic curve (AUC-ROC) was used to compare the accuracy of the two methods. RESULTS The rate of ALBT was 18.9% in total. Patient-related factors associated with higher risk of an ALBT included female sex, American Society of Anesthesiologists (ASA) II, ASA III, and ASA IV. Surgery-related risk factors for ALBT were operative time, drain use, and amount of intraoperative blood loss. Higher preoperative hemoglobin and tranexamic acid use were associated with decreased risk for ALBT. The RF model had a better predictive accuracy (area under the curve [AUC] 0.84) than the LR model (AUC, 0.77; p < 0.001). CONCLUSION The risk factors identified in the current study can provide specific, personalized perioperative ALBT risk assessment for a patient considering lower limb TJA. Furthermore, the predictive accuracy of the RF algorithm was significantly higher than that of LR, making it a potential tool for future personalized preoperative prediction of risk for perioperative ALBT.
Collapse
Affiliation(s)
- ZeYu Huang
- Department of Orthopedic Surgery, West China Hospital, West China Medical School, Sichuan University
| | - Cheng Huang
- College of Cybersecurity, Chengdu, Sichuan Province, People's Republic of China
| | - JinWei Xie
- Department of Orthopedic Surgery, West China Hospital, West China Medical School, Sichuan University
| | - Jun Ma
- Department of Orthopedic Surgery, West China Hospital, West China Medical School, Sichuan University
| | - GuoRui Cao
- Department of Orthopedic Surgery, West China Hospital, West China Medical School, Sichuan University
| | - Qiang Huang
- Department of Orthopedic Surgery, West China Hospital, West China Medical School, Sichuan University
| | - Bin Shen
- Department of Orthopedic Surgery, West China Hospital, West China Medical School, Sichuan University
| | - Virginia Byers Kraus
- Duke Molecular Physiology Institute, Durham, North Carolina.,Division of Rheumatology, Department of Medicine, Duke University School of Medicine, Duke University, Durham, North Carolina
| | - FuXing Pei
- Department of Orthopedic Surgery, West China Hospital, West China Medical School, Sichuan University
| |
Collapse
|
13
|
Abstract
In modeling censored data, survival forest models are a competitive nonparametric alternative to traditional parametric or semiparametric models when the function forms are possibly misspecified or the underlying assumptions are violated. In this work, we propose a survival forest approach with trees constructed using a novel pseudo R2 splitting rules. By studying the well-known benchmark data sets, we find that the proposed model generally outperforms popular survival models such as random survival forest with different splitting rules, Cox proportional hazard model, and generalized boosted model in terms of C-index metric.
Collapse
Affiliation(s)
- Hong Wang
- School of Mathematics and Statistics, Central South University, Changsha, China
| | - Xiaolin Chen
- School of Statistics, Qufu Normal University, Qufu, China
| | - Gang Li
- Department of Biostatistics, School of Public Health, University of California at Los Angeles, Los Angeles, California
| |
Collapse
|
14
|
Gong X, Hu M, Zhao L. Big Data Toolsets to Pharmacometrics: Application of Machine Learning for Time-to-Event Analysis. Clin Transl Sci 2018. [PMID: 29536640 PMCID: PMC5944589 DOI: 10.1111/cts.12541] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
Additional value can be potentially created by applying big data tools to address pharmacometric problems. The performances of machine learning (ML) methods and the Cox regression model were evaluated based on simulated time‐to‐event data synthesized under various preset scenarios, i.e., with linear vs. nonlinear and dependent vs. independent predictors in the proportional hazard function, or with high‐dimensional data featured by a large number of predictor variables. Our results showed that ML‐based methods outperformed the Cox model in prediction performance as assessed by concordance index and in identifying the preset influential variables for high‐dimensional data. The prediction performances of ML‐based methods are also less sensitive to data size and censoring rates than the Cox regression model. In conclusion, ML‐based methods provide a powerful tool for time‐to‐event analysis, with a built‐in capacity for high‐dimensional data and better performance when the predictor variables assume nonlinear relationships in the hazard function.
Collapse
Affiliation(s)
- Xiajing Gong
- Division of Quantitative Methods and Modeling, Office of Research and Standards, Office of Generic Drugs, Center for Drug Evaluation and Research, Food and Drug Administration, Silver Spring, Maryland, USA
| | - Meng Hu
- Division of Quantitative Methods and Modeling, Office of Research and Standards, Office of Generic Drugs, Center for Drug Evaluation and Research, Food and Drug Administration, Silver Spring, Maryland, USA
| | - Liang Zhao
- Division of Quantitative Methods and Modeling, Office of Research and Standards, Office of Generic Drugs, Center for Drug Evaluation and Research, Food and Drug Administration, Silver Spring, Maryland, USA
| |
Collapse
|
15
|
Ow GS, Tang Z, Kuznetsov VA. Big data and computational biology strategy for personalized prognosis. Oncotarget 2018; 7:40200-40220. [PMID: 27229533 PMCID: PMC5130003 DOI: 10.18632/oncotarget.9571] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2015] [Accepted: 05/01/2016] [Indexed: 01/05/2023] Open
Abstract
The era of big data and precision medicine has led to accumulation of massive datasets of gene expression data and clinical information of patients. For a new patient, we propose that identification of a highly similar reference patient from an existing patient database via similarity matching of both clinical and expression data could be useful for predicting the prognostic risk or therapeutic efficacy. Here, we propose a novel methodology to predict disease/treatment outcome via analysis of the similarity between any pair of patients who are each characterized by a certain set of pre-defined biological variables (biomarkers or clinical features) represented initially as a prognostic binary variable vector (PBVV) and subsequently transformed to a prognostic signature vector (PSV). Our analyses revealed that Euclidean distance rather correlation distance measure was effective in defining an unbiased similarity measure calculated between two PSVs. We implemented our methods to high-grade serous ovarian cancer (HGSC) based on a 36-mRNA predictor that was previously shown to stratify patients into 3 distinct prognostic subgroups. We studied and revealed that patient's age, when converted into binary variable, was positively correlated with the overall risk of succumbing to the disease. When applied to an independent testing dataset, the inclusion of age into the molecular predictor provided more robust personalized prognosis of overall survival correlated with the therapeutic response of HGSC and provided benefit for treatment targeting of the tumors in HGSC patients. Finally, our method can be generalized and implemented in many other diseases to accurately predict personalized patients’ outcomes.
Collapse
Affiliation(s)
| | | | - Vladimir A Kuznetsov
- Bioinformatics Institute, Singapore 138671.,School of Computer Engineering, Nanyang Technological University, Singapore 639798
| |
Collapse
|
16
|
Pang H, Wang X. Statistical aspect of translational and correlative studies in clinical trials. Chin Clin Oncol 2017; 5:11. [PMID: 26932435 DOI: 10.3978/j.issn.2304-3865.2014.07.04] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2014] [Accepted: 06/18/2014] [Indexed: 01/07/2023]
Abstract
In this article, we describe statistical issues related to the conduct of translational and correlative studies in cancer clinical trials. In the era of personalized medicine, proper biomarker discovery and validation is crucial for producing groundbreaking research. In order to carry out the framework outlined in this article, a team effort between oncologists and statisticians is the key for success.
Collapse
Affiliation(s)
- Herbert Pang
- School of Public Health, Li Ka Shing Faculty of Medicine, Pok Fu Lam, Hong Kong SAR, China.
| | - Xiaofei Wang
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, USA.
| |
Collapse
|
17
|
Pang H, Kim I, Zhao H. Random Effects Model for Multiple Pathway Analysis with Applications to Type II Diabetes Microarray Data. STATISTICS IN BIOSCIENCES 2015; 7:167-186. [PMID: 26640601 PMCID: PMC4666561 DOI: 10.1007/s12561-014-9109-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
Close to three percent of the world's population suffer from diabetes. Despite the range of treatment options available for diabetes patients, not all patients benefit from them. Investigating how different pathways correlate with phenotype of interest may help unravel novel drug targets and discover a possible cure. Many pathway-based methods have been developed to incorporate biological knowledge into the study of microarray data. Most of these methods can only analyze individual pathways but cannot deal with two or more pathways in a model based framework. This represents a serious limitation because, like genes, individual pathways do not work in isolation, and joint modeling may enable researchers to uncover patterns not seen in individual pathway-based analysis. In this paper, we propose a random effects model to analyze two or more pathways. We also derive score test statistics for significance of pathway effects. We apply our method to a microarray study of Type II diabetes. Our method may eludicate how pathways crosstalk with each other and facilitate the investigation of pathway crosstalks. Further hypothesis on the biological mechanisms underlying the disease and traits of interest may be generated and tested based on this method.
Collapse
Affiliation(s)
- Herbert Pang
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, North Carolina 27705, U.S.A. Tel.: +919-681-5011, Fax: +919-668-5888
| | - Inyoung Kim
- Department of Statistics, Virginia Polytechnic Institute and State University, Blacksburg, Virginia 24061, U.S.A. Tel.: +540-231-5366, Fax: +540-231-3863
| | - Hongyu Zhao
- Department of Biostatistics, Yale School of Public Health, and Department of Genetics, Yale University School of Medicine, New Haven, Connecticut 06520, U.S.A. Tel.: +203-785-6271, Fax: +203-785-6912
| |
Collapse
|
18
|
Jing GJ, Zhang Z, Wang HQ, Zheng HM. Mining gene link information for survival pathway hunting. IET Syst Biol 2015; 9:147-54. [PMID: 26243831 DOI: 10.1049/iet-syb.2014.0048] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
This study proposes a gene link-based method for survival time-related pathway hunting. In this method, the authors incorporate gene link information to estimate how a pathway is associated with cancer patient's survival time. Specifically, a gene link-based Cox proportional hazard model (Link-Cox) is established, in which two linked genes are considered together to represent a link variable and the association of the link with survival time is assessed using Cox proportional hazard model. On the basis of the Link-Cox model, the authors formulate a new statistic for measuring the association of a pathway with survival time of cancer patients, referred to as pathway survival score (PSS), by summarising survival significance over all the gene links in the pathway, and devise a permutation test to test the significance of an observed PSS. To evaluate the proposed method, the authors applied it to simulation data and two publicly available real-world gene expression data sets. Extensive comparisons with previous methods show the effectiveness and efficiency of the proposed method for survival pathway hunting.
Collapse
Affiliation(s)
- Gao-Jian Jing
- School of Mechanical and Automotive Engineering, Hefei University of Technology, Hefei, People's Republic of China
| | - Zirui Zhang
- School of Mechanical and Automotive Engineering, Hefei University of Technology, Hefei, People's Republic of China
| | - Hong-Qiang Wang
- Machine Intelligence & Computational Biology Lab, Institute of Intelligent Machines, Chinese Academy of Sciences, P.O. Box 1130, Hefei, Anhui 230031, People's Republic of China
| | - Hong-Mei Zheng
- School of Mechanical and Automotive Engineering, Hefei University of Technology, Hefei, People's Republic of China.
| |
Collapse
|
19
|
Ye S, Dawson JA, Kendziorski C. Extending information retrieval methods to personalized genomic-based studies of disease. Cancer Inform 2015; 13:85-95. [PMID: 25733795 PMCID: PMC4332045 DOI: 10.4137/cin.s16354] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2014] [Revised: 10/22/2014] [Accepted: 10/23/2014] [Indexed: 01/30/2023] Open
Abstract
Genomic-based studies of disease now involve diverse types of data collected on large groups of patients. A major challenge facing statistical scientists is how best to combine the data, extract important features, and comprehensively characterize the ways in which they affect an individual’s disease course and likelihood of response to treatment. We have developed a survival-supervised latent Dirichlet allocation (survLDA) modeling framework to address these challenges. Latent Dirichlet allocation (LDA) models have proven extremely effective at identifying themes common across large collections of text, but applications to genomics have been limited. Our framework extends LDA to the genome by considering each patient as a “document” with “text” detailing his/her clinical events and genomic state. We then further extend the framework to allow for supervision by a time-to-event response. The model enables the efficient identification of collections of clinical and genomic features that co-occur within patient subgroups, and then characterizes each patient by those features. An application of survLDA to The Cancer Genome Atlas ovarian project identifies informative patient subgroups showing differential response to treatment, and validation in an independent cohort demonstrates the potential for patient-specific inference.
Collapse
Affiliation(s)
- Shuyun Ye
- Department of Statistics, University of Wisconsin, Madison, WI, USA
| | - John A Dawson
- Department of Statistics, University of Wisconsin, Madison, WI, USA
| | - Christina Kendziorski
- Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, USA
| |
Collapse
|
20
|
Pang H, Zhao H. Stratified pathway analysis to identify gene sets associated with oral contraceptive use and breast cancer. Cancer Inform 2014; 13:73-8. [PMID: 25574128 PMCID: PMC4263464 DOI: 10.4137/cin.s13973] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2014] [Revised: 08/15/2014] [Accepted: 08/19/2014] [Indexed: 01/02/2023] Open
Abstract
Cancer biomarker discovery can facilitate drug development, improve staging of patients, and predict patient prognosis. Because cancer is the result of many interacting genes, analysis based on a set of genes with related biological functions or pathways may be more informative than single gene-based analysis for cancer biomarker discovery. The relevant pathways thus identified may help characterize different aspects of molecular phenotypes related to the tumor. Although it is well known that cancer patients may respond to the same treatment differently because of clinical variables and variation of molecular phenotypes, this patient heterogeneity has not been explicitly considered in pathway analysis in the literature. We hypothesize that combining pathway and patient clinical information can more effectively identify relevant pathways pertinent to specific patient subgroups, leading to better diagnosis and treatment. In this article, we propose to perform stratified pathway analysis based on clinical information from patients. In contrast to analysis using all the patients, this more focused analysis has the potential to reveal subgroup-specific pathways that may lead to more biological insights into disease etiology and treatment response. As an illustration, the power of our approach is demonstrated through its application to a breast cancer dataset in which the patients are stratified according to their oral contraceptive use.
Collapse
Affiliation(s)
- Herbert Pang
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, USA. ; School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Hongyu Zhao
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| |
Collapse
|
21
|
Dellinger AE, Nixon AB, Pang H. Integrative Pathway Analysis Using Graph-Based Learning with Applications to TCGA Colon and Ovarian Data. Cancer Inform 2014; 13:1-9. [PMID: 25125969 PMCID: PMC4125381 DOI: 10.4137/cin.s13634] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2013] [Revised: 03/17/2014] [Accepted: 03/18/2014] [Indexed: 12/15/2022] Open
Abstract
Recent method development has included multi-dimensional genomic data algorithms because such methods have more accurately predicted clinical phenotypes related to disease. This study is the first to conduct an integrative genomic pathway-based analysis with a graph-based learning algorithm. The methodology of this analysis, graph-based semi-supervised learning, detects pathways that improve prediction of a dichotomous variable, which in this study is cancer stage. This analysis integrates genome-level gene expression, methylation, and single nucleotide polymorphism (SNP) data in serous cystadenocarcinoma (OV) and colon adenocarcinoma (COAD). The top 10 ranked predictive pathways in COAD and OV were biologically relevant to their respective cancer stages and significantly enhanced prediction accuracy and area under the ROC curve (AUC) when compared to single data-type analyses. This method is an effective way to simultaneously predict binary clinical phenotypes and discover their biological mechanisms.
Collapse
Affiliation(s)
- Andrew E Dellinger
- Department of Mathematics and Statistics, Elon University, Elon, NC, USA
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, USA
| | - Andrew B Nixon
- Department of Medicine, Division of Medical Oncology, Duke University School of Medicine, Durham, NC, USA
| | - Herbert Pang
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, USA
- School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|
22
|
Pang H, Jung SH. Sample size considerations of prediction-validation methods in high-dimensional data for survival outcomes. Genet Epidemiol 2013; 37:276-82. [PMID: 23471879 DOI: 10.1002/gepi.21721] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2012] [Revised: 01/21/2013] [Accepted: 02/09/2013] [Indexed: 11/09/2022]
Abstract
A variety of prediction methods are used to relate high-dimensional genome data with a clinical outcome using a prediction model. Once a prediction model is developed from a data set, it should be validated using a resampling method or an independent data set. Although the existing prediction methods have been intensively evaluated by many investigators, there has not been a comprehensive study investigating the performance of the validation methods, especially with a survival clinical outcome. Understanding the properties of the various validation methods can allow researchers to perform more powerful validations while controlling for type I error. In addition, sample size calculation strategy based on these validation methods is lacking. We conduct extensive simulations to examine the statistical properties of these validation strategies. In both simulations and a real data example, we have found that 10-fold cross-validation with permutation gave the best power while controlling type I error close to the nominal level. Based on this, we have also developed a sample size calculation method that will be used to design a validation study with a user-chosen combination of prediction. Microarray and genome-wide association studies data are used as illustrations. The power calculation method in this presentation can be used for the design of any biomedical studies involving high-dimensional data and survival outcomes.
Collapse
Affiliation(s)
- Herbert Pang
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, North Carolina, USA
| | | |
Collapse
|
23
|
Chen X, Ishwaran H. Pathway hunting by random survival forests. Bioinformatics 2013; 29:99-105. [PMID: 23129299 PMCID: PMC3530909 DOI: 10.1093/bioinformatics/bts643] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2012] [Revised: 07/18/2012] [Accepted: 10/17/2012] [Indexed: 01/22/2023] Open
Abstract
MOTIVATION Pathway or gene set analysis has been widely applied to genomic data. Many current pathway testing methods use univariate test statistics calculated from individual genomic markers, which ignores the correlations and interactions between candidate markers. Random forests-based pathway analysis is a promising approach for incorporating complex correlation and interaction patterns, but one limitation of previous approaches is that pathways have been considered separately, thus pathway cross-talk information was not considered. RESULTS In this article, we develop a new pathway hunting algorithm for survival outcomes using random survival forests, which prioritize important pathways by accounting for gene correlation and genomic interactions. We show that the proposed method performs favourably compared with five popular pathway testing methods using both synthetic and real data. We find that the proposed methodology provides an efficient and powerful pathway modelling framework for high-dimensional genomic data. AVAILABILITY The R code for the analysis used in this article is available upon request.
Collapse
Affiliation(s)
- Xi Chen
- Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, TN 37232, USA.
| | | |
Collapse
|
24
|
Pang H, George SL, Hui K, Tong T. Gene selection using iterative feature elimination random forests for survival outcomes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1422-31. [PMID: 22547432 PMCID: PMC3495190 DOI: 10.1109/tcbb.2012.63] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Although many feature selection methods for classification have been developed, there is a need to identify genes in high-dimensional data with censored survival outcomes. Traditional methods for gene selection in classification problems have several drawbacks. First, the majority of the gene selection approaches for classification are single-gene based. Second, many of the gene selection procedures are not embedded within the algorithm itself. The technique of random forests has been found to perform well in high-dimensional data settings with survival outcomes. It also has an embedded feature to identify variables of importance. Therefore, it is an ideal candidate for gene selection in high-dimensional data with survival outcomes. In this paper, we develop a novel method based on the random forests to identify a set of prognostic genes. We compare our method with several machine learning methods and various node split criteria using several real data sets. Our method performed well in both simulations and real data analysis.Additionally, we have shown the advantages of our approach over single-gene-based approaches. Our method incorporates multivariate correlations in microarray data for survival outcomes. The described method allows us to better utilize the information available from microarray data with survival outcomes.
Collapse
Affiliation(s)
- Herbert Pang
- Biostatistics and Bioinformatics Department, Duke University School of Medicine, Durham, NC 27705.
| | - Stephen L. George
- Biostatistics and Bioinformatics Department, Duke University School of Medicine, Durham, NC 27705.
| | - Ken Hui
- School of Medicine, Yale University, New Haven, CT 06510.
| | - Tiejun Tong
- Mathematics Department, Hong Kong Baptist University, Hong Kong SAR, China.
| |
Collapse
|
25
|
Chen X, Ishwaran H. Random forests for genomic data analysis. Genomics 2012; 99:323-9. [PMID: 22546560 PMCID: PMC3387489 DOI: 10.1016/j.ygeno.2012.04.003] [Citation(s) in RCA: 380] [Impact Index Per Article: 31.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2012] [Revised: 04/11/2012] [Accepted: 04/14/2012] [Indexed: 11/25/2022]
Abstract
Random forests (RF) is a popular tree-based ensemble machine learning tool that is highly data adaptive, applies to "large p, small n" problems, and is able to account for correlation as well as interactions among features. This makes RF particularly appealing for high-dimensional genomic data analysis. In this article, we systematically review the applications and recent progresses of RF for genomic data, including prediction and classification, variable selection, pathway analysis, genetic association and epistasis detection, and unsupervised learning.
Collapse
Affiliation(s)
- Xi Chen
- Department of Biostatistics, Vanderbilt University, Nashville, TN 37232, USA.
| | | |
Collapse
|
26
|
Development and validation of a quantitative real-time polymerase chain reaction classifier for lung cancer prognosis. J Thorac Oncol 2011; 6:1481-7. [PMID: 21792073 DOI: 10.1097/jto.0b013e31822918bd] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
INTRODUCTION This prospective study aimed to develop a robust and clinically applicable method to identify patients with high-risk early-stage lung cancer and then to validate this method for use in future translational studies. METHODS Three published Affymetrix microarray data sets representing 680 primary tumors were used in the survival-related gene selection procedure using clustering, Cox model, and random survival forest analysis. A final set of 91 genes was selected and tested as a predictor of survival using a quantitative real-time polymerase chain reaction-based assay using an independent cohort of 101 lung adenocarcinomas. RESULTS The random survival forest model built from 91 genes in the training set predicted patient survival in an independent cohort of 101 lung adenocarcinomas, with a prediction error rate of 26.6%. The mortality risk index was significantly related to survival (Cox model p < 0.00001) and separated all patients into low-, medium-, and high-risk groups (hazard ratio = 1.00, 2.82, 4.42). The mortality risk index was also related to survival in stage 1 patients (Cox model p = 0.001), separating patients into low-, medium-, and high-risk groups (hazard ratio = 1.00, 3.29, 3.77). CONCLUSIONS The development and validation of this robust quantitative real-time polymerase chain reaction platform allows prediction of patient survival with early-stage lung cancer. Utilization will now allow investigators to evaluate it prospectively by incorporation into new clinical trials with the goal of personalized treatment of patients with lung cancer and improving patient survival.
Collapse
|
27
|
Wood DJ, Buttar D, Cumming JG, Davis AM, Norinder U, Rodgers SL. Automated QSAR with a Hierarchy of Global and Local Models. Mol Inform 2011; 30:960-72. [DOI: 10.1002/minf.201100107] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2011] [Accepted: 10/13/2011] [Indexed: 11/06/2022]
|
28
|
Porzelius C, Johannes M, Binder H, Beißbarth T. Leveraging external knowledge on molecular interactions in classification methods for risk prediction of patients. Biom J 2011; 53:190-201. [DOI: 10.1002/bimj.201000155] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2010] [Revised: 10/22/2010] [Accepted: 10/29/2010] [Indexed: 12/17/2022]
|
29
|
Abstract
In recent years, several association analysis methods for case-control studies have been developed. However, as we turn towards the identification of single nucleotide polymorphisms (SNPs) for prognosis, there is a need to develop methods for the identification of SNPs in high dimensional data with survival outcomes. Traditional methods for the identification of SNPs have some drawbacks. First, the majority of the approaches for case-control studies are based on single SNPs. Second, SNPs that are identified without incorporating biological knowledge are more difficult to interpret. Random forests has been found to perform well in gene expression analysis with survival outcomes. In this paper we present the first pathway-based method to correlate SNP with survival outcomes using a machine learning algorithm. We illustrate the application of pathway-based analysis of SNPs predictive of survival with a data set of 192 multiple myeloma patients genotyped for 500,000 SNPs. We also present simulation studies that show that the random forests technique with log-rank score split criterion outperforms several other machine learning algorithms. Thus, pathway-based survival analysis using machine learning tools represents a promising approach for the identification of biologically meaningful SNPs associated with disease.
Collapse
|