1
|
Yang Y, McMahan CS, Wang YB, Ouyang Y. Estimation of l0 Norm Penalized Models: A Statistical Treatment. Comput Stat Data Anal 2024; 192:107902. [PMID: 38222104 PMCID: PMC10785287 DOI: 10.1016/j.csda.2023.107902] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2024]
Abstract
Fitting penalized models for the purpose of merging the estimation and model selection problem has become commonplace in statistical practice. Of the various regularization strategies that can be leveraged to this end, the use of the l 0 norm to penalize parameter estimation poses the most daunting model fitting task. In fact, this particular strategy requires an end user to solve a non-convex NP-hard optimization problem irregardless of the underlying data model. For this reason, the use of the l 0 norm as a regularization strategy has been woefully under utilized. To obviate this difficulty, a strategy to solve such problems that is generally accessible by the statistical community is developed. The approach can be adopted to solve l 0 norm penalized problems across a very broad class of models, can be implemented using existing software, and is computationally efficient. The performance of the method is demonstrated through in-depth numerical experiments and through using it to analyze several prototypical data sets.
Collapse
Affiliation(s)
- Yuan Yang
- School of Mathematical and Statistical Sciences, Clemson University, Clemson, 29634, SC, U.S.A
| | - Christopher S McMahan
- School of Mathematical and Statistical Sciences, Clemson University, Clemson, 29634, SC, U.S.A
| | - Yu-Bo Wang
- School of Mathematical and Statistical Sciences, Clemson University, Clemson, 29634, SC, U.S.A
| | - Yuyuan Ouyang
- School of Mathematical and Statistical Sciences, Clemson University, Clemson, 29634, SC, U.S.A
| |
Collapse
|
2
|
Huang TJ, Luedtke A, McKeague IW. EFFICIENT ESTIMATION OF THE MAXIMAL ASSOCIATION BETWEEN MULTIPLE PREDICTORS AND A SURVIVAL OUTCOME. Ann Stat 2023; 51:1965-1988. [PMID: 38405375 PMCID: PMC10888526 DOI: 10.1214/23-aos2313] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
This paper develops a new approach to post-selection inference for screening high-dimensional predictors of survival outcomes. Post-selection inference for right-censored outcome data has been investigated in the literature, but much remains to be done to make the methods both reliable and computationally-scalable in high-dimensions. Machine learning tools are commonly used to provide predictions of survival outcomes, but the estimated effect of a selected predictor suffers from confirmation bias unless the selection is taken into account. The new approach involves the construction of semi-parametrically efficient estimators of the linear association between the predictors and the survival outcome, which are used to build a test statistic for detecting the presence of an association between any of the predictors and the outcome. Further, a stabilization technique reminiscent of bagging allows a normal calibration for the resulting test statistic, which enables the construction of confidence intervals for the maximal association between predictors and the outcome and also greatly reduces computational cost. Theoretical results show that this testing procedure is valid even when the number of predictors grows superpolynomially with sample size, and our simulations support this asymptotic guarantee at moderate sample sizes. The new approach is applied to the problem of identifying patterns in viral gene expression associated with the potency of an antiviral drug.
Collapse
Affiliation(s)
- Tzu-Jung Huang
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center
| | - Alex Luedtke
- Department of Statistics, University of Washington
| | | |
Collapse
|
3
|
Kim T, Kim SJ, Lee BY, Cho HJ, Sa BG, Ryu IH, Kim JK, Lee IS, Han E, Kim H, Yoo TK. Development of an implantable collamer lens sizing model: a retrospective study using ANTERION swept-source optical coherence tomography and a literature review. BMC Ophthalmol 2023; 23:59. [PMID: 36765328 PMCID: PMC9921691 DOI: 10.1186/s12886-023-02814-7] [Citation(s) in RCA: 18] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Accepted: 02/09/2023] [Indexed: 02/12/2023] Open
Abstract
BACKGROUND Optimal sizing for phakic intraocular lens (EVO-ICL with KS-AquaPort) implantation plays an important role in preventing postoperative complications. We aimed to formulate optimal lens sizing using ocular biometric parameters measured with a Heidelberg anterior segment optical coherence tomography (AS-OCT) device. METHODS We retrospectively analyzed 892 eyes of 471 healthy subjects treated with an intraocular collamer lens (ICL) and assigned them to either the development (80%) or validation (20%) set. We built vault prediction models using the development set via classic linear regression methods as well as partial least squares and least absolute shrinkage and selection operator (LASSO) regression techniques. We evaluated prediction abilities based on the Bayesian information criterion (BIC) to select the best prediction model. The performance was measured using Pearson's correlation coefficient and the mean squared error (MAE) between the achieved and predicted results. RESULTS Measurements of aqueous depth (AQD), anterior chamber volume, anterior chamber angle (ACA) distance, spur-to-spur distance, crystalline lens thickness (LT), and white-to-white distance from ANTERION were highly associated with the ICL vault. The LASSO model using the AQD, ACA distance, and LT showed the best BIC results for postoperative ICL vault prediction. In the validation dataset, the LASSO model showed the strongest correlation (r = 0.582, P < 0.001) and the lowest MAE (104.7 μm). CONCLUSION This is the first study to develop a postoperative ICL vault prediction and lens-sizing model based on the ANTERION. As the measurements from ANTERION and other AS-OCT devices are not interchangeable, ANTERION may be used for optimal ICL sizing using our formula. Because our model was developed based on the East Asian population, further studies are needed to explore the role of this prediction model in different populations.
Collapse
Affiliation(s)
| | | | | | | | | | - Ik Hee Ryu
- VISUWORKS, Seoul, South Korea ,Department of Refractive Surgery, B&VIIT Eye Center, 1317-23 Seocho-Dong, Seocho-Gu, Seoul, South Korea
| | - Jin Kuk Kim
- VISUWORKS, Seoul, South Korea ,Department of Refractive Surgery, B&VIIT Eye Center, 1317-23 Seocho-Dong, Seocho-Gu, Seoul, South Korea
| | - In Sik Lee
- Department of Refractive Surgery, B&VIIT Eye Center, 1317-23 Seocho-Dong, Seocho-Gu, Seoul, South Korea
| | - Eoksoo Han
- grid.36303.350000 0000 9148 4899Electronics and Telecommunications Research Institute (ETRI), Daejeon, South Korea
| | | | - Tae Keun Yoo
- VISUWORKS, Seoul, South Korea. .,Department of Refractive Surgery, B&VIIT Eye Center, 1317-23 Seocho-Dong, Seocho-Gu, Seoul, South Korea.
| |
Collapse
|
4
|
Wang H, Li Q, Liu Y. Regularized Buckley-James method for right-censored outcomes with block-missing multimodal covariates. Stat (Int Stat Inst) 2022; 11:e515. [PMID: 37854542 PMCID: PMC10583730 DOI: 10.1002/sta4.515] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Accepted: 10/10/2022] [Indexed: 10/20/2023]
Abstract
High-dimensional data with censored outcomes of interest are prevalent in medical research. To analyze such data, the regularized Buckley-James estimator has been successfully applied to build accurate predictive models and conduct variable selection. In this paper, we consider the problem of parameter estimation and variable selection for the semiparametric accelerated failure time model for high-dimensional block-missing multimodal neuroimaging data with censored outcomes. We propose a penalized Buckley-James method that can simultaneously handle block-wise missing covariates and censored outcomes. This method can also perform variable selection. The proposed method is evaluated by simulations and applied to a multimodal neuroimaging dataset and obtains meaningful results.
Collapse
Affiliation(s)
- Haodong Wang
- Department of Statistics and Operations Research, The University of North Carolina at Chapel Hill, Chapel Hill, 27599, North Carolina, USA
| | - Quefeng Li
- Department of Biostatistics, The University of North Carolina at Chapel Hill, Chapel Hill, 27516, North Carolina, USA
| | - Yufeng Liu
- Department of Statistics and Operations Research, The University of North Carolina at Chapel Hill, Chapel Hill, 27599, North Carolina, USA
- Department of Biostatistics, The University of North Carolina at Chapel Hill, Chapel Hill, 27516, North Carolina, USA
- Department of Genetics, The University of North Carolina at Chapel Hill, Chapel Hill, 27599-7264, North Carolina, USA
- Carolina Center for Genome Sciences, The University of North Carolina at Chapel Hill, Chapel Hill, 27514, North Carolina, USA
- Lineberger Comprehensive Cancer Center, The University of North Carolina at Chapel Hill, Chapel Hill, 27514, North Carolina, USA
| |
Collapse
|
5
|
Guo Y, Yang Y, Cao F, Liu Y, Li W, Yang C, Feng M, Luo Y, Cheng L, Li Q, Zeng X, Miao X, Li L, Qiu W, Kang Y. Radiomics features of DSC-PWI in time dimension may provide a new chance to identify ischemic stroke. Front Neurol 2022; 13:889090. [DOI: 10.3389/fneur.2022.889090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2022] [Accepted: 08/25/2022] [Indexed: 11/06/2022] Open
Abstract
Ischemic stroke has become a severe disease endangering human life. However, few studies have analyzed the radiomics features that are of great clinical significance for the diagnosis, treatment, and prognosis of patients with ischemic stroke. Due to sufficient cerebral blood flow information in dynamic susceptibility contrast perfusion-weighted imaging (DSC-PWI) images, this study aims to find the critical features hidden in DSC-PWI images to characterize hypoperfusion areas (HA) and normal areas (NA). This study retrospectively analyzed 80 DSC-PWI data of 56 patients with ischemic stroke from 2013 to 2016. For exploring features in HA and NA,13 feature sets (Fmethod) were obtained from different feature selection algorithms. Furthermore, these 13 Fmethod were validated in identifying HA and NA and distinguishing the proportion of ischemic lesions in brain tissue. In identifying HA and NA, the composite score (CS) of the 13 Fmethod ranged from 0.624 to 0.925. FLasso in the 13 Fmethod achieved the best performance with mAcc of 0.958, mPre of 0.96, mAuc of 0.982, mF1 of 0.959, and mRecall of 0.96. As to classifying the proportion of the ischemic region, the best CS was 0.786, with Acc of 0.888 and Pre of 0.863. The classification ability was relatively stable when the reference threshold (RT) was <0.25. Otherwise, when RT was >0.25, the performance will gradually decrease as its increases. These results showed that radiomics features extracted from the Lasso algorithms could accurately reflect cerebral blood flow changes and classify HA and NA. Besides, In the event of ischemic stroke, the ability of radiomics features to distinguish the proportion of ischemic areas needs to be improved. Further research should be conducted on feature engineering, model optimization, and the universality of the algorithms in the future.
Collapse
|
6
|
Wang JY, Yin YH, Zheng JY, Liu LF, Yao ZP, Xin GZ. Least absolute shrinkage and selection operator-based prediction of collision cross section values for ion mobility mass spectrometric analysis of lipids. Analyst 2022; 147:1236-1244. [DOI: 10.1039/d1an02161c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
A least absolute shrinkage and selection operator (LASSO)-based prediction method was developed for the prediction of lipids’ CCS values.
Collapse
Affiliation(s)
- Jian-Ying Wang
- State Key Laboratory of Natural Medicines, Department of Chinese Medicines Analysis, School of Traditional Chinese Pharmacy, China Pharmaceutical University, No. 24 Tongjia Lane, Nanjing, China
- State Key Laboratory of Chinese Medicine and Molecular Pharmacology (Incubation) and Shenzhen Key Laboratory of Food Biological Safety Control, Shenzhen Research Institute of Hong Kong Polytechnic University, Shenzhen 518057, China
- State Key Laboratory of Chemical Biology and Drug Discovery, Food Safety and Technology Research Centre and Department of Applied Biology and Chemical Technology, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong SAR, China
| | - Ying-Hao Yin
- State Key Laboratory of Natural Medicines, Department of Chinese Medicines Analysis, School of Traditional Chinese Pharmacy, China Pharmaceutical University, No. 24 Tongjia Lane, Nanjing, China
| | - Jia-Yi Zheng
- State Key Laboratory of Natural Medicines, Department of Chinese Medicines Analysis, School of Traditional Chinese Pharmacy, China Pharmaceutical University, No. 24 Tongjia Lane, Nanjing, China
| | - Li-Fang Liu
- State Key Laboratory of Natural Medicines, Department of Chinese Medicines Analysis, School of Traditional Chinese Pharmacy, China Pharmaceutical University, No. 24 Tongjia Lane, Nanjing, China
| | - Zhong-Ping Yao
- State Key Laboratory of Chinese Medicine and Molecular Pharmacology (Incubation) and Shenzhen Key Laboratory of Food Biological Safety Control, Shenzhen Research Institute of Hong Kong Polytechnic University, Shenzhen 518057, China
- State Key Laboratory of Chemical Biology and Drug Discovery, Food Safety and Technology Research Centre and Department of Applied Biology and Chemical Technology, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong SAR, China
| | - Gui-Zhong Xin
- State Key Laboratory of Natural Medicines, Department of Chinese Medicines Analysis, School of Traditional Chinese Pharmacy, China Pharmaceutical University, No. 24 Tongjia Lane, Nanjing, China
| |
Collapse
|
7
|
Xiong J, He W. Identification of survival relevant genes with measurement error in gene expression incorporated. COMMUN STAT-THEOR M 2021. [DOI: 10.1080/03610926.2021.2004424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Juan Xiong
- Health Science Center, Shengzhen University, Shengzhen, Guangdong, P. R. China
| | - Wenqing He
- University of Western Ontario, London, Ontario, Canada
| |
Collapse
|
8
|
Affiliation(s)
- Rahim Alhamzawi
- Department of Statistics, University of Al-Qadisiyah, Al Diwaniyah, Iraq
| |
Collapse
|
9
|
Sun Z, Liu Y, Chen K, Li G. Broken adaptive ridge regression for right-censored survival data. ANN I STAT MATH 2021. [DOI: 10.1007/s10463-021-00794-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
10
|
Min KS, Sheridan B, Waryasz GR, Joeris A, Warner JJP, Ring D, Chen N. Predicting reoperation after operative treatment of proximal humerus fractures. EUROPEAN JOURNAL OF ORTHOPAEDIC SURGERY AND TRAUMATOLOGY 2021; 31:1105-1112. [PMID: 33394141 DOI: 10.1007/s00590-020-02841-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/11/2020] [Accepted: 11/18/2020] [Indexed: 10/22/2022]
Abstract
PURPOSE The current understanding of the factors associated with a second surgery or loss of alignment after operative treatment of a proximal humerus fracture has relied on small sample studies with stepwise regression analysis. In this study, we used a powerful regression analysis over a large sample and with many variables to test the null hypothesis that there are no factors associated with a revision surgery or loss of alignment after operative treatment of proximal humerus fractures. METHODS A retrospective review of all surgically treated proximal humerus fractures from January 1, 2000, to December 31, 2015, was performed at a tertiary level hospital. We extracted longitudinal medical records for all patients, and the data were organized into two categories of predictors: fracture/operative characteristics and patient characteristics. RESULTS During the study period, 423 patients met the inclusion criteria. Three hundred and fourteen of the fractures underwent Open Reduction Internal Fixation (ORIF) and 109 underwent Hemiarthroplasty. Thirty-three patients underwent revision surgery (8%). Seventy-nine patients treated with ORIF had loss of alignment (25%). Across the entire cohort, the least absolute shrinkage selection operator (LASSO) analysis found that patients between 40 and 60 years of age had a higher odds of revision surgery (OR = 1.6). In patients treated with ORIF, the LASSO regression found an unreduced calcar to be the strongest predictor of loss of alignment (OR = 5.5), followed by osteoporosis (OR = 1.3), prior radiation treatment (OR = 1.3), unreduced greater tuberosity (OR = 1.2) and age over 80 years (OR = 1.2). CONCLUSION Reoperation after proximal humerus surgery is infrequent even though loss of alignment is common. In our cohort, not all patients who had a loss of alignment underwent revision surgery; consequently, obtaining the best possible reduction at the index surgery is paramount.
Collapse
Affiliation(s)
- Kyong S Min
- Department of Orthopaedic Surgery, Tripler Army Medical Center, 1 Jarrett White Road, 4F, Honolulu, HI, 96859, USA.
| | | | | | - Alexander Joeris
- AO Clinical Investigation and Documentation, AO Foundation, Duebendorf, Switzerland
| | | | - David Ring
- Dell Medical School-The University of Texas at Austin, Austin, TX, USA
| | - Neal Chen
- Massachusetts General Hospital, Boston, MA, USA
| |
Collapse
|
11
|
Su Y, Chen Y, Tian Z, Lu C, Chen L, Ma X. lncRNAs classifier to accurately predict the recurrence of thymic epithelial tumors. Thorac Cancer 2020; 11:1773-1783. [PMID: 32374079 PMCID: PMC7327696 DOI: 10.1111/1759-7714.13439] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Revised: 03/28/2020] [Accepted: 03/30/2020] [Indexed: 12/12/2022] Open
Abstract
Background Long non‐coding RNAs (lncRNAs), which have little or no ability to encode proteins, have attracted special attention due to their potential role in cancer disease. In this study we aimed to establish a lncRNAs classifier to improve the accuracy of recurrence prediction for thymic epithelial tumors (TETs). Methods TETs RNA sequencing (RNA‐seq) data set and the matched clinicopathologic information were downloaded from the Cancer Genome Atlas. Using univariate Cox regression and least absolute shrinkage and selection operator (LASSO) analysis, we developed a lncRNAs classifier related to recurrence. Functional analysis was conducted to investigate the potential biological processes of the lncRNAs target genes. The independent prognostic factors were identified by Cox regression model. Additionally, predictive ability and clinical application of the lncRNAs classifier were assessed, and compared with the Masaoka staging by receiver operating characteristic (ROC) analysis and decision curve analysis (DCA). Results Four recurrence‐free survival (RFS)‐related lncRNAs were identified, and the classifier consisting of the identified four lncRNAs was able to effectively divide the patients into high and low risk subgroups, with an area under curve (AUC) of 0.796 (three‐year RFS) and 0.788 (five‐year RFS), respectively. Multivariate analysis indicated that the lncRNAs classifier was an independent recurrence risk factor. The AUC of the lncRNAs classifier in predicting RFS was significantly higher than the Masaoka staging system. Decision curve analysis further demonstrated that the lncRNAs classifier had a larger net benefit than the Masaoka staging system. Conclusions A lncRNAs classifier for patients with TETs was an independent risk factor for RFS despite other clinicopathologic variables. It generated more accurate estimations of the recurrence probability when compared to the Masaoka staging system, but additional data is required before it can be used in clinical practice.
Collapse
Affiliation(s)
- Yongchao Su
- Department of Thoracic Surgery, Sanya Central Hospital, Sanya, China
| | - Yongbing Chen
- Department of Thoracic Surgery, The Second Affiliated Hospital of Soochow University, Suzhou, China
| | - Zuochun Tian
- Department of Thoracic Surgery, Sanya Central Hospital, Sanya, China
| | - Chuangang Lu
- Department of Thoracic Surgery, Sanya Central Hospital, Sanya, China
| | - Liang Chen
- Department of Respiratory Medicine, Sanya Central Hospital, Sanya, China
| | - Ximiao Ma
- Department of Thoracic Surgery, Haikou People's Hospital, Haikou, China
| |
Collapse
|
12
|
Wang M, Long Q, Chen C, Zhang L. Assessing predictive accuracy of survival regressions subject to nonindependent censoring. Stat Med 2020; 39:469-480. [PMID: 31814158 DOI: 10.1002/sim.8420] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2019] [Revised: 08/28/2019] [Accepted: 10/13/2019] [Indexed: 11/06/2022]
Abstract
Survival regression is commonly applied in biomedical studies or clinical trials, and evaluating their predictive performance plays an essential role for model diagnosis and selection. The presence of censored data, particularly if informative, may pose more challenges for the assessment of predictive accuracy. Existing literature mainly focuses on prediction for survival probabilities with limitation work for survival time. In this work, we focus on accuracy measures of predicted survival times adjusted for a potentially informative censoring mechanism (ie, coarsening at random (CAR); non-CAR) by adopting the technique of inverse probability of censoring weighting. Our proposed predictive metric can be adaptive to various survival regression frameworks including but not limited to accelerated failure time models and proportional hazards models. Moreover, we provide the asymptotic properties of the inverse probability of censoring weighting estimators under CAR. We consider the settings of high-dimensional data under CAR or non-CAR for extensions. The performance of the proposed method is evaluated through extensive simulation studies and analysis of real data from the Critical Assessment of Microarray Data Analysis.
Collapse
Affiliation(s)
- Ming Wang
- Division of Biostatistics and Bioinformatics, Department of Public Health Sciences, Penn State University, Hershey, Pennsylvania
| | - Qi Long
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Chixiang Chen
- Division of Biostatistics and Bioinformatics, Department of Public Health Sciences, Penn State University, Hershey, Pennsylvania
| | - Lijun Zhang
- Institute for Personalized Medicine, Penn State University, Hershey, Pennsylvania
| |
Collapse
|
13
|
Xiong TF, Pan FQ, Liang Q, Luo R, Li D, Mo H, Zhou X. Prognostic value of the expression of chemokines and their receptors in regional lymph nodes of melanoma patients. J Cell Mol Med 2020; 24:3407-3418. [PMID: 31983065 PMCID: PMC7131952 DOI: 10.1111/jcmm.15015] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2019] [Revised: 12/10/2019] [Accepted: 12/21/2019] [Indexed: 12/17/2022] Open
Abstract
Chemokines and their receptors have been reported to drive immune cells into tumours or to be directly involved in the promotion or inhibition of the development of tumours. However, their expression in regional lymph node (LN) tissues in melanoma patients remains unknown. The present study investigated the relationship between the expression of mRNA of chemokines and their receptors and clinicopathology of the regional LN tissues of skin cutaneous melanoma (SKCM) patients available in The Cancer Genome Atlas. The relationship between chemokines and their receptors and the composition of immune cells within the tumour was analysed. In SKCM regional LN tissues, the high expression of 32 types of chemokines and receptors, namely CCL2, 4‐5, 7‐8, 13, 22‐25, CCR1‐9, CXCL9‐13, 16, CXCR3, 5, 6, XCL1‐2 and XCR1 in LN was associated with favourable patient prognosis. Conversely, high expression of CXCL17 was an indicator of poor prognosis. The expression of mRNA for CXCL9‐11, 13, CXCR3, 6, CCL2, 4, 5, 7, 8, 25, CCR1, 2, 5, and XCL1, 2 in regional LN tissues was positively correlated with the fraction of CD8‐positive T cells and M1 macrophages, and was negatively correlated with M0 macrophages. CCR4, 6‐9, CCL13, 22, 23 and XCR1 were positively correlated with the fraction of memory B cells and naive T cells, and negatively correlated with M0 macrophages and resting mast cells, suggesting that chemokines and their receptors may affect the prognosis of patients by guiding immune cells into the tumour microenvironment to eliminate tumour cells.
Collapse
Affiliation(s)
- Ting-Feng Xiong
- Department of Medical Treatment Cosmetology, The Second Affiliated Hospital of Guangxi Medical University, Nanning, China
| | - Fu-Qiang Pan
- Department of Medical Treatment Cosmetology, The Second Affiliated Hospital of Guangxi Medical University, Nanning, China
| | - Qian Liang
- Department of Medical Treatment Cosmetology, The Second Affiliated Hospital of Guangxi Medical University, Nanning, China
| | - Ruijin Luo
- Medical Department, The Second Affiliated Hospital of Guangxi Medical University, Nanning, China
| | - Dong Li
- Department of Medical Treatment Cosmetology, The Second Affiliated Hospital of Guangxi Medical University, Nanning, China
| | - Haiyan Mo
- Department of Medical Treatment Cosmetology, The Second Affiliated Hospital of Guangxi Medical University, Nanning, China
| | - Xiang Zhou
- Department of Medical Treatment Cosmetology, The Second Affiliated Hospital of Guangxi Medical University, Nanning, China
| |
Collapse
|
14
|
Huang TJ, McKeague IW, Qian M. Marginal screening for high-dimensional predictors of survival outcomes. Stat Sin 2019; 29:2105-2139. [PMID: 31938013 PMCID: PMC6959482 DOI: 10.5705/ss.202017.0298] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
This study develops a marginal screening test to detect the presence of significant predictors for a right-censored time-to-event outcome under a high-dimensional accelerated failure time (AFT) model. Establishing a rigorous screening test in this setting is challenging, because of the right censoring and the post-selection inference. In the latter case, an implicit variable selection step needs to be included to avoid inflating the Type-I error. A prior study solved this problem by constructing an adaptive resampling test under an ordinary linear regression. To accommodate right censoring, we develop a new approach based on a maximally selected Koul-Susarla-Van Ryzin estimator from a marginal AFT working model. A regularized bootstrap method is used to calibrate the test. Our test is more powerful and less conservative than both a Bonferroni correction of the marginal tests and other competing methods. The proposed method is evaluated in simulation studies and applied to two real data sets.
Collapse
Affiliation(s)
| | | | - Min Qian
- Department of Biostatistics, Columbia University
| |
Collapse
|
15
|
Liu F, Zhang H, Xue L, Yang Q, Yan W. Molecular profiling of transcription factors pinpoints MYC-estrogen related receptor α-regulatory factor X5 panel for characterizing the immune microenvironment and predicting the efficacy of immune checkpoint inhibitors in renal cell carcinoma. Oncol Lett 2019; 18:1895-1903. [PMID: 31423259 PMCID: PMC6614680 DOI: 10.3892/ol.2019.10523] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2018] [Accepted: 05/22/2019] [Indexed: 02/07/2023] Open
Abstract
Transcription factors (TFs) play key roles in biological processes, and previous studies revealed that they can control oncogenic processes. However, the functional impact of TFs on the prognosis of patients with cancer has not been extensively elucidated. In the context of The Cancer Genome Atlas, few studies have focused on the roles of TFs in tumorigenesis. In the present study, a TF-based robust MYC-estrogen related receptor α-regulatory factor X5 (MYC-ESRRA-RFX5) signature was developed for predicting the survival of patients with renal cell carcinoma. Functional enrichment analysis of this signature revealed that it was associated with the immune system of these patients. Further analysis demonstrated that this panel could characterize the immune microenvironment and potentially predicts the effectiveness of immune checkpoint inhibitors. Therefore, the present study recommends future exploration on TF-based biomarkers for their potential as prognostic predictors. Overall, the highlights of this study are: i) This novel study pinpoints a TF panel for the robust prediction of renal cell carcinoma prognosis, and ii) the MYC-ESRRA-RFX5 panel is proposed as a signature for characterizing the immune microenvironment, and to potentially predict the effectiveness of immune checkpoint inhibitors.
Collapse
Affiliation(s)
- Fei Liu
- Department of Nephrology, The 940th Hospital of Joint Logistics Support Force of Chinese People's Liberation Army, Lanzhou, Gansu 730000, P.R. China
| | - Hongxia Zhang
- Department of Emergency Medicine, The First Hospital of The Chinese People's Liberation Army, Lanzhou, Gansu 730000, P.R. China
| | - Lihua Xue
- Department of Obstetrics and Gynecology, The Family Planning Service Center for Maternal and Child Health in Zhouqu County, Lanzhou, Gansu 730000, P.R. China
| | - Qiankun Yang
- Department of Bone and Soft Tissue, Cancer Hospital of China Medical University, Liaoning Cancer Hospital & Institute, Shenyang, Liaoning 110042, P.R. China
| | - Wanchun Yan
- Department of Geriatrics, The First Hospital of The Chinese People's Liberation Army, Lanzhou, Gansu 730000, P.R. China
| |
Collapse
|
16
|
Molstad AJ, Hsu L, Sun W. Gaussian process regression for survival time prediction with genome-wide gene expression. Biostatistics 2019; 22:164-180. [PMID: 31292609 DOI: 10.1093/biostatistics/kxz023] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Revised: 04/07/2019] [Accepted: 05/13/2019] [Indexed: 11/14/2022] Open
Abstract
Predicting the survival time of a cancer patient based on his/her genome-wide gene expression remains a challenging problem. For certain types of cancer, the effects of gene expression on survival are both weak and abundant, so identifying non-zero effects with reasonable accuracy is difficult. As an alternative to methods that use variable selection, we propose a Gaussian process accelerated failure time model to predict survival time using genome-wide or pathway-wide gene expression data. Using a Monte Carlo expectation-maximization algorithm, we jointly impute censored log-survival time and estimate model parameters. We demonstrate the performance of our method and its advantage over existing methods in both simulations and real data analysis. The real data that we analyze were collected from 513 patients with kidney renal clear cell carcinoma and include survival time, demographic/clinical variables, and expression of more than 20 000 genes. In addition to the right-censored survival time, our method can also accommodate left-censored or interval-censored outcomes; and it provides a natural way to combine multiple types of high-dimensional -omics data. An R package implementing our method is available in the Supplementary material available at Biostatistics online.
Collapse
Affiliation(s)
- Aaron J Molstad
- Biostatistics Program, Public Health Sciences Division, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., Seattle, WA, USA
| | - Li Hsu
- Biostatistics Program, Public Health Sciences Division, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., Seattle, WA, USA and Department of Biostatistics, University of Washington, 1705 NE Pacific St, Seattle, WA 98195, USA
| | - Wei Sun
- Biostatistics Program, Public Health Sciences Division, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., Seattle, WA, USA and Department of Biostatistics, University of Washington, 1705 NE Pacific St, Seattle, WA 98195, USA and Department of Biostatistics, University of North Carolina at Chapel Hill, 135 Dauer Dr, Chapel Hill, NC 27599, USA
| |
Collapse
|
17
|
Wang S, Zhang H, Chai H, Liang Y. A novel Log penalty in a path seeking scheme for biomarker selection. Technol Health Care 2019; 27:85-93. [PMID: 31045529 PMCID: PMC6598102 DOI: 10.3233/thc-199009] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Biomarker selection or feature selection from survival data is a topic of considerable interest. Recently various survival analysis approaches for biomarker selection have been developed; however, there are growing challenges to currently methods for handling high-dimensional and low-sample problem. We propose a novel Log-sum regularization estimator within accelerated failure time (AFT) for predicting cancer patient survival time with a few biomarkers. This approach is implemented in path seeking algorithm to speed up solving the Log-sum penalty. Additionally, the control parameter of Log-sum penalty is modified by Bayesian information criterion (BIC). The results indicate that our proposed approach is able to achieve good performance in both simulated and real datasets with other ℓ1 type regularization methods for biomarker selection.
Collapse
Affiliation(s)
- Sai Wang
- Faculty of Information Technology, Macau University of Science and Technology, Macau, China
| | - Hui Zhang
- Faculty of Information Technology, Macau University of Science and Technology, Macau, China
| | - Hua Chai
- Faculty of Information Technology, Macau University of Science and Technology, Macau, China
| | - Yong Liang
- Faculty of Information Technology, Macau University of Science and Technology, Macau, China.,State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Macau, China
| |
Collapse
|
18
|
Wang S, Shen HW, Chai H, Liang Y. Complex harmonic regularization with differential evolution in a memetic framework for biomarker selection. PLoS One 2019; 14:e0210786. [PMID: 30763332 PMCID: PMC6375558 DOI: 10.1371/journal.pone.0210786] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2018] [Accepted: 01/02/2019] [Indexed: 01/23/2023] Open
Abstract
For studying cancer and genetic diseases, the issue of identifying high correlation genes from high-dimensional data is an important problem. It is a great challenge to select relevant biomarkers from gene expression data that contains some important correlation structures, and some of the genes can be divided into different groups with a common biological function, chromosomal location or regulation. In this paper, we propose a penalized accelerated failure time model CHR-DE using a non-convex regularization (local search) with differential evolution (global search) in a wrapper-embedded memetic framework. The complex harmonic regularization (CHR) can approximate to the combination ℓp(12≤p<1) and ℓq (1 ≤ q < 2) for selecting biomarkers in group. And differential evolution (DE) is utilized to globally optimize the CHR’s hyperparameters, which make CHR-DE achieve strong capability of selecting groups of genes in high-dimensional biological data. We also developed an efficient path seeking algorithm to optimize this penalized model. The proposed method is evaluated on synthetic and three gene expression datasets: breast cancer, hepatocellular carcinoma and colorectal cancer. The experimental results demonstrate that CHR-DE is a more effective tool for feature selection and learning prediction.
Collapse
Affiliation(s)
- Sai Wang
- Faculty of Information Technology, Macau University of Science and Technology, Taipa, Macau
| | - Hai-Wei Shen
- Faculty of Information Technology, Macau University of Science and Technology, Taipa, Macau
| | - Hua Chai
- Faculty of Information Technology, Macau University of Science and Technology, Taipa, Macau
| | - Yong Liang
- Faculty of Information Technology, Macau University of Science and Technology, Taipa, Macau
- State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Taipa, Macau
- * E-mail:
| |
Collapse
|
19
|
Soret P, Avalos M, Wittkop L, Commenges D, Thiébaut R. Lasso regularization for left-censored Gaussian outcome and high-dimensional predictors. BMC Med Res Methodol 2018; 18:159. [PMID: 30514234 PMCID: PMC6280495 DOI: 10.1186/s12874-018-0609-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2017] [Accepted: 11/02/2018] [Indexed: 12/14/2022] Open
Abstract
Background Biological assays for the quantification of markers may suffer from a lack of sensitivity and thus from an analytical detection limit. This is the case of human immunodeficiency virus (HIV) viral load. Below this threshold the exact value is unknown and values are consequently left-censored. Statistical methods have been proposed to deal with left-censoring but few are adapted in the context of high-dimensional data. Methods We propose to reverse the Buckley-James least squares algorithm to handle left-censored data enhanced with a Lasso regularization to accommodate high-dimensional predictors. We present a Lasso-regularized Buckley-James least squares method with both non-parametric imputation using Kaplan-Meier and parametric imputation based on the Gaussian distribution, which is typically assumed for HIV viral load data after logarithmic transformation. Cross-validation for parameter-tuning is based on an appropriate loss function that takes into account the different contributions of censored and uncensored observations. We specify how these techniques can be easily implemented using available R packages. The Lasso-regularized Buckley-James least square method was compared to simple imputation strategies to predict the response to antiretroviral therapy measured by HIV viral load according to the HIV genotypic mutations. We used a dataset composed of several clinical trials and cohorts from the Forum for Collaborative HIV Research (HIV Med. 2008;7:27-40). The proposed methods were also assessed on simulated data mimicking the observed data. Results Approaches accounting for left-censoring outperformed simple imputation methods in a high-dimensional setting. The Gaussian Buckley-James method with cross-validation based on the appropriate loss function showed the lowest prediction error on simulated data and, using real data, the most valid results according to the current literature on HIV mutations. Conclusions The proposed approach deals with high-dimensional predictors and left-censored outcomes and has shown its interest for predicting HIV viral load according to HIV mutations.
Collapse
Affiliation(s)
- Perrine Soret
- Univ. Bordeaux, Inserm, Bordeaux Population Health Research Center, UMR 1219, Bordeaux, F-33000, France.,Inria SISTM Team, Talence, F-33405, France.,Vaccine Research Institute (VRI), Créteil, F-94000, France
| | - Marta Avalos
- Univ. Bordeaux, Inserm, Bordeaux Population Health Research Center, UMR 1219, Bordeaux, F-33000, France. .,Inria SISTM Team, Talence, F-33405, France.
| | - Linda Wittkop
- Univ. Bordeaux, Inserm, Bordeaux Population Health Research Center, UMR 1219, Bordeaux, F-33000, France.,Inria SISTM Team, Talence, F-33405, France.,CHU Bordeaux, Department of Public Health, Bordeaux, F-33000, France
| | - Daniel Commenges
- Univ. Bordeaux, Inserm, Bordeaux Population Health Research Center, UMR 1219, Bordeaux, F-33000, France.,Inria SISTM Team, Talence, F-33405, France
| | - Rodolphe Thiébaut
- Univ. Bordeaux, Inserm, Bordeaux Population Health Research Center, UMR 1219, Bordeaux, F-33000, France.,Inria SISTM Team, Talence, F-33405, France.,Vaccine Research Institute (VRI), Créteil, F-94000, France.,CHU Bordeaux, Department of Public Health, Bordeaux, F-33000, France
| |
Collapse
|
20
|
Khan MHR. On the performance of adaptive preprocessing technique in analyzing high-dimensional censored data. Biom J 2018; 60:687-702. [PMID: 29603360 DOI: 10.1002/bimj.201600256] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2016] [Revised: 09/05/2017] [Accepted: 10/20/2017] [Indexed: 11/09/2022]
Abstract
Preprocessing for high-dimensional censored datasets, such as the microarray data, is generally considered as an important technique to gain further stability by reducing potential noise from the data. When variable selection including inference is carried out with high-dimensional censored data the objective is to obtain a smaller subset of variables and then perform the inferential analysis using model estimates based on the selected subset of variables. This two stage inferential analysis is prone to circularity bias because of the noise that might still remain in the dataset. In this work, I propose an adaptive preprocessing technique that uses sure independence screening (SIS) idea to accomplish variable selection and reduces the circularity bias by some popularly known refined high-dimensional methods such as the elastic net, adaptive elastic net, weighted elastic net, elastic net-AFT, and two greedy variable selection methods known as TCS, PC-simple all implemented with the accelerated lifetime models. The proposed technique addresses several features including the issue of collinearity between important and some unimportant covariates, which is often the case in high-dimensional setting under variable selection framework, and different level of censoring. Simulation studies along with an empirical analysis with a real microarray data, mantle cell lymphoma, is carried out to demonstrate the performance of the adaptive pre-processing technique.
Collapse
Affiliation(s)
- Md Hasinur Rahaman Khan
- Applied Statistics, Institute of Statistical Research and Training, University of Dhaka, Dhaka, 1000, Bangladesh
| |
Collapse
|
21
|
Yoo JK. Fused sliced inverse regression in survival analysis. COMMUNICATIONS FOR STATISTICAL APPLICATIONS AND METHODS 2017. [DOI: 10.5351/csam.2017.24.5.533] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Jae Keun Yoo
- Department of Statistics, Ewha Womans University, Korea
| |
Collapse
|
22
|
Khan MHR, Shaw JEH. On dealing with censored largest observations under weighted least squares. J STAT COMPUT SIM 2016. [DOI: 10.1080/00949655.2016.1185794] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
23
|
Yoo JK, Kim SJ, Seo BS, Shina H, Sim SA. Dimension reduction for right-censored survival regression: transformation approach. COMMUNICATIONS FOR STATISTICAL APPLICATIONS AND METHODS 2016. [DOI: 10.5351/csam.2016.23.3.259] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
24
|
Laimighofer M, Krumsiek J, Buettner F, Theis FJ. Unbiased Prediction and Feature Selection in High-Dimensional Survival Regression. J Comput Biol 2016; 23:279-90. [PMID: 26894327 PMCID: PMC4827277 DOI: 10.1089/cmb.2015.0192] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
With widespread availability of omics profiling techniques, the analysis and interpretation of high-dimensional omics data, for example, for biomarkers, is becoming an increasingly important part of clinical medicine because such datasets constitute a promising resource for predicting survival outcomes. However, early experience has shown that biomarkers often generalize poorly. Thus, it is crucial that models are not overfitted and give accurate results with new data. In addition, reliable detection of multivariate biomarkers with high predictive power (feature selection) is of particular interest in clinical settings. We present an approach that addresses both aspects in high-dimensional survival models. Within a nested cross-validation (CV), we fit a survival model, evaluate a dataset in an unbiased fashion, and select features with the best predictive power by applying a weighted combination of CV runs. We evaluate our approach using simulated toy data, as well as three breast cancer datasets, to predict the survival of breast cancer patients after treatment. In all datasets, we achieve more reliable estimation of predictive power for unseen cases and better predictive performance compared to the standard CoxLasso model. Taken together, we present a comprehensive and flexible framework for survival models, including performance estimation, final feature selection, and final model construction. The proposed algorithm is implemented in an open source R package (SurvRank) available on CRAN.
Collapse
Affiliation(s)
- Michael Laimighofer
- 1 Institute of Computational Biology , Helmholtz-Zentrum München, Neuherberg, Germany .,2 Department of Mathematics, TU München , Garching, Germany
| | - Jan Krumsiek
- 1 Institute of Computational Biology , Helmholtz-Zentrum München, Neuherberg, Germany .,3 German Center for Diabetes Research (DZD) , München-Neuherberg, Germany
| | - Florian Buettner
- 1 Institute of Computational Biology , Helmholtz-Zentrum München, Neuherberg, Germany .,4 European Bioinformatics Institute , European Molecular Biology Laboratory Hinxton, Cambridge, United Kingdom
| | - Fabian J Theis
- 1 Institute of Computational Biology , Helmholtz-Zentrum München, Neuherberg, Germany .,2 Department of Mathematics, TU München , Garching, Germany
| |
Collapse
|
25
|
Park J. Quantile Regression with Left-Truncated and Right-Censored Data in a Reproducing Kernel Hilbert Space. COMMUN STAT-THEOR M 2015. [DOI: 10.1080/03610926.2013.777741] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
26
|
Tan H, Zhang H, Xie J, Chen B, Wen C, Guo X, Zhao Q, Wu Z, Shen J, Wu J, Xu X, Li E, Xu L, Wang X. A novel staging model to classify oesophageal squamous cell carcinoma patients in China. Br J Cancer 2014; 110:2109-15. [PMID: 24569468 PMCID: PMC3992487 DOI: 10.1038/bjc.2014.101] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2013] [Revised: 01/03/2014] [Accepted: 01/29/2014] [Indexed: 02/05/2023] Open
Abstract
BACKGROUND Oesophageal squamous cell carcinoma (ESCC) is the predominant subtype of oesophageal carcinoma in China, with the overall 5-year survival rate of <10%. The current tumour-node-metastasis (TNM) staging system has become so complex that it is not easy to use in the life expectancy assessment. We aim to combine clinical variables and biomarkers to develop and validate a relative simple and reliable model, named the FENSAM, for ESCC prognosis. METHODS To build the FENSAM, we analysed 22 potential prognostic factors from 461 patients, including 9 biomarkers (Ezrin, Fascin, desmocollin 2 (DSC2), pFascin, activating transcription factor 3 (ATF3), connective-tissue growth factor (CTGF), neutrophil gelatinase-associated lipocalin (NGAL), NGAL receptor (NGALR), and cysteine-rich angiogenic protein 61 (CYR61)) and other 13 clinical variables. We selected significant factors associated with survival of ESCC patients, and used them to build our FENSAM model. We then obtained the hazard risk score of the model to classify ESCC patients. In addition, we validated the model in an independent cohort of 290 patients from the same hospital. The predictive performance of the model was assessed by the Area under the Receiver Operating Characteristic Curve (AUC) and Kaplan-Meier survival analysis. RESULTS We found six markers significantly associated with survival of ESCC patients (Ezrin, Fascin, ATF3, surgery extent, N-stage, and M-stage). They were combined to create a novel four-stage FENSAM model for patients' classification. FENSAM possessed a high classification precision similar to the TNM staging system, but with a much simpler model. The efficiency of FENSAM was evaluated by different quantiles of AUC and the results of survival analysis. The validation result demonstrated the potential of the FENSAM model to improve classification accuracy for ESCC patients. CONCLUSIONS FENSAM provides an alternative classifier for ESCC patients with a high classification precision using a simple model.
Collapse
Affiliation(s)
- H Tan
- Department of Biomedical Engineering, Zhongshan School of Medicine, Sun Yat-Sen University, 135 Xin Gang W. Road, Guangzhou, China
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, China
- Southern China Research Center of Statistical Science, Sun Yat-Sen University, Guangzhou 510275, China
| | - H Zhang
- Southern China Research Center of Statistical Science, Sun Yat-Sen University, Guangzhou 510275, China
- Yale University School of Public Health, New Haven, CT, USA
- Department of Statistical Science, School of Mathematics and Computational Science, Sun Yat-Sen University, Guangzhou, China
| | - J Xie
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, China
| | - B Chen
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, China
| | - C Wen
- Southern China Research Center of Statistical Science, Sun Yat-Sen University, Guangzhou 510275, China
- Department of Statistical Science, School of Mathematics and Computational Science, Sun Yat-Sen University, Guangzhou, China
| | - X Guo
- Southern China Research Center of Statistical Science, Sun Yat-Sen University, Guangzhou 510275, China
- Department of Statistical Science, School of Mathematics and Computational Science, Sun Yat-Sen University, Guangzhou, China
| | - Q Zhao
- Department of Pathology, Shantou Central Hospital, Affiliated Shantou Hospital of Sun Yat-Sen University, Shantou, China
| | - Z Wu
- Department of Pathology, Shantou Central Hospital, Affiliated Shantou Hospital of Sun Yat-Sen University, Shantou, China
| | - J Shen
- Department of Pathology, Shantou Central Hospital, Affiliated Shantou Hospital of Sun Yat-Sen University, Shantou, China
| | - J Wu
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, China
| | - X Xu
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, China
| | - E Li
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, China
| | - L Xu
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, China
| | - X Wang
- Department of Biomedical Engineering, Zhongshan School of Medicine, Sun Yat-Sen University, 135 Xin Gang W. Road, Guangzhou, China
- Southern China Research Center of Statistical Science, Sun Yat-Sen University, Guangzhou 510275, China
- Department of Statistical Science, School of Mathematics and Computational Science, Sun Yat-Sen University, Guangzhou, China
| |
Collapse
|
27
|
Mostajabi F, Datta S, Datta S. Predicting Patient Survival from Proteomic Profile using Mass Spectrometry Data: An Empirical Study. COMMUN STAT-SIMUL C 2013. [DOI: 10.1080/03610918.2011.636165] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
28
|
Ma S, Zhang Y, Huang J, Huang Y, Lan Q, Rothman N, Zheng T. Integrative analysis of cancer prognosis data with multiple subtypes using regularized gradient descent. Genet Epidemiol 2012; 36:829-38. [PMID: 22851516 PMCID: PMC3729731 DOI: 10.1002/gepi.21669] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2012] [Revised: 06/07/2012] [Accepted: 06/26/2012] [Indexed: 11/10/2022]
Abstract
In cancer research, high-throughput profiling studies have been extensively conducted, searching for genes/single nucleotide polymorphisms (SNPs) associated with prognosis. Despite seemingly significant differences, different subtypes of the same cancer (or different types of cancers) may share common susceptibility genes. In this study, we analyze prognosis data on multiple subtypes of the same cancer but note that the proposed approach is directly applicable to the analysis of data on multiple types of cancers. We describe the genetic basis of multiple subtypes using the heterogeneity model that allows overlapping but different sets of susceptibility genes/SNPs for different subtypes. An accelerated failure time (AFT) model is adopted to describe prognosis. We develop a regularized gradient descent approach that conducts gene-level analysis and identifies genes that contain important SNPs associated with prognosis. The proposed approach belongs to the family of gradient descent approaches, is intuitively reasonable, and has affordable computational cost. Simulation study shows that when prognosis-associated SNPs are clustered in a small number of genes, the proposed approach outperforms alternatives with significantly more true positives and fewer false positives. We analyze an NHL (non-Hodgkin lymphoma) prognosis study with SNP measurements and identify genes associated with the three major subtypes of NHL, namely, DLBCL, FL, and CLL/SLL. The proposed approach identifies genes different from using alternative approaches and has the best prediction performance.
Collapse
Affiliation(s)
| | | | - Jian Huang
- Departments of Statistics & Actuarial Science, and Biostatistics, University of Iowa
| | - Yuan Huang
- Department of Statistics, Penn State University
| | - Qing Lan
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, NIH
| | - Nathaniel Rothman
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, NIH
| | | |
Collapse
|
29
|
Ma S, Dai Y, Huang J, Xie Y. Identification of Breast Cancer Prognosis Markers via Integrative Analysis. Comput Stat Data Anal 2012; 56:2718-2728. [PMID: 22773869 DOI: 10.1016/j.csda.2012.02.017] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
In breast cancer research, it is of great interest to identify genomic markers associated with prognosis. Multiple gene profiling studies have been conducted for such a purpose. Genomic markers identified from the analysis of single datasets often do not have satisfactory reproducibility. Among the multiple possible reasons, the most important one is the small sample sizes of individual studies. A cost-effective solution is to pool data from multiple comparable studies and conduct integrative analysis. In this study, we collect four breast cancer prognosis studies with gene expression measurements. We describe the relationship between prognosis and gene expressions using the accelerated failure time (AFT) models. We adopt a 2-norm group bridge penalization approach for marker identification. This integrative analysis approach can effectively identify markers with consistent effects across multiple datasets and naturally accommodate the heterogeneity among studies. Statistical and simulation studies demonstrate satisfactory performance of this approach. Breast cancer prognosis markers identified using this approach have sound biological implications and satisfactory prediction performance.
Collapse
|
30
|
Goodenough AE, Hart AG, Stafford R. Regression with empirical variable selection: description of a new method and application to ecological datasets. PLoS One 2012; 7:e34338. [PMID: 22479605 PMCID: PMC3316704 DOI: 10.1371/journal.pone.0034338] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2011] [Accepted: 03/01/2012] [Indexed: 11/18/2022] Open
Abstract
Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset – habitat and offspring quality in the great tit (Parus major) – the optimal REVS model explained more variance (higher R2), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R2 values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of “core” variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.
Collapse
Affiliation(s)
- Anne E Goodenough
- Department of Natural and Social Sciences, University of Gloucestershire, Cheltenham, United Kingdom.
| | | | | |
Collapse
|
31
|
Ma S, Huang J, Xie Y, Yi N. Identification of breast cancer prognosis markers using integrative sparse boosting. Methods Inf Med 2012; 51:152-61. [PMID: 22344268 DOI: 10.3414/me11-02-0019] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2011] [Accepted: 11/08/2011] [Indexed: 11/09/2022]
Abstract
OBJECTIVES In breast cancer research, it is important to identify genomic markers associated with prognosis. Multiple microarray gene expression profiling studies have been conducted, searching for prognosis markers. Genomic markers identified from the analysis of single datasets often suffer a lack of reproducibility because of small sample sizes. Integrative analysis of data from multiple independent studies has a larger sample size and may provide a cost-effective solution. METHODS We collect four breast cancer prognosis studies with gene expression measurements. An accelerated failure time (AFT) model with an unknown error distribution is adopted to describe survival. An integrative sparse boosting approach is employed for marker selection. The proposed model and boosting approach can effectively accommodate heterogeneity across multiple studies and identify genes with consistent effects. RESULTS Simulation study shows that the proposed approach outperforms alternatives including meta-analysis and intensity approaches by identifying the majority or all of the true positives, while having a low false positive rate. In the analysis of breast cancer data, 44 genes are identified as associated with prognosis. Many of the identified genes have been previously suggested as associated with tumorigenesis and cancer prognosis. The identified genes and corresponding predicted risk scores differ from those using alternative approaches. Monte Carlo-based prediction evaluation suggests that the proposed approach has the best prediction performance. CONCLUSIONS Integrative analysis may provide an effective way of identifying breast cancer prognosis markers. Markers identified using the integrative sparse boosting analysis have sound biological implications and satisfactory prediction performance.
Collapse
Affiliation(s)
- S Ma
- School of Public Health, Yale University, New Haven CT 06520, USA.
| | | | | | | |
Collapse
|
32
|
Fitting marginal accelerated failure time models to clustered survival data with potentially informative cluster size. Comput Stat Data Anal 2011. [DOI: 10.1016/j.csda.2011.06.015] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
33
|
ZHAO YICHUAN, WANG GUOSHEN. ADDITIVE RISK ANALYSIS OF MICROARRAY GENE EXPRESSION DATA VIA CORRELATION PRINCIPAL COMPONENT REGRESSION. J Bioinform Comput Biol 2011. [DOI: 10.1142/s0219720010004914] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In order to predict future patients' survival time based on their microarray gene expression data, one interesting question is how to relate genes to survival outcomes. In this paper, by applying a semi-parametric additive risk model in survival analysis, we propose a new approach to conduct a careful analysis of gene expression data with the focus on the model's predictive ability. In the proposed method, we apply the correlation principal component regression to deal with right censoring survival data under the semi-parametric additive risk model frame with high-dimensional covariates. We also employ the time-dependent area under the receiver operating characteristic curve and root mean squared error for prediction to assess how well the model can predict the survival time. Furthermore, the proposed method is able to identify significant genes, which are significantly related to the disease. Finally, the proposed useful approach is illustrated by the diffuse large B-cell lymphoma data set and breast cancer data set. The results show that the model fits the data sets very well.
Collapse
Affiliation(s)
- YICHUAN ZHAO
- Department of Mathematics and Statistics, Georgia State University, Atlanta, GA 30303, USA
| | - GUOSHEN WANG
- Department of Mathematics and Statistics, Georgia State University, Atlanta, GA 30303, USA
| |
Collapse
|
34
|
Ma S, Huang J, Wei F, Xie Y, Fang K. Integrative analysis of multiple cancer prognosis studies with gene expression measurements. Stat Med 2011; 30:3361-71. [PMID: 22105693 DOI: 10.1002/sim.4337] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2010] [Accepted: 06/07/2011] [Indexed: 11/11/2022]
Abstract
Although in cancer research microarray gene profiling studies have been successful in identifying genetic variants predisposing to the development and progression of cancer, the identified markers from analysis of single datasets often suffer low reproducibility. Among multiple possible causes, the most important one is the small sample size hence the lack of power of single studies. Integrative analysis jointly considers multiple heterogeneous studies, has a significantly larger sample size, and can improve reproducibility. In this article, we focus on cancer prognosis studies, where the response variables are progression-free, overall, or other types of survival. A group minimax concave penalty (GMCP) penalized integrative analysis approach is proposed for analyzing multiple heterogeneous cancer prognosis studies with microarray gene expression measurements. An efficient group coordinate descent algorithm is developed. The GMCP can automatically accommodate the heterogeneity across multiple datasets, and the identified markers have consistent effects across multiple studies. Simulation studies show that the GMCP provides significantly improved selection results as compared with the existing meta-analysis approaches, intensity approaches, and group Lasso penalized integrative analysis. We apply the GMCP to four microarray studies and identify genes associated with the prognosis of breast cancer.
Collapse
Affiliation(s)
- Shuangge Ma
- School of Public Health, Yale University, New Haven, CT, USA.
| | | | | | | | | |
Collapse
|
35
|
Abstract
This article considers the problem of selecting predictors of time to an event from a high-dimensional set of candidate predictors using data from multiple studies. As an alternative to the current multistage testing approaches, we propose to model the study-to-study heterogeneity explicitly using a hierarchical model to borrow strength. Our method incorporates censored data through an accelerated failure time model. Using a carefully formulated prior specification, we develop a fast approach to predictor selection and shrinkage estimation for high-dimensional predictors. For model fitting, we develop a Monte Carlo expectation maximization (MC-EM) algorithm to accommodate censored data. The proposed approach, which is related to the relevance vector machine (RVM), relies on maximum a posteriori estimation to rapidly obtain a sparse estimate. As for the typical RVM, there is an intrinsic thresholding property in which unimportant predictors tend to have their coefficients shrunk to zero. We compare our method with some commonly used procedures through simulation studies. We also illustrate the method using the gene expression barcode data from three breast cancer studies.
Collapse
Affiliation(s)
- Fei Liu
- IBM T. J. Watson Research Center, Yorktown Heights, New York 10598, USA.
| | | | | |
Collapse
|
36
|
Li X, Gill R, Cooper NGF, Yoo JK, Datta S. Modeling microRNA-mRNA interactions using PLS regression in human colon cancer. BMC Med Genomics 2011; 4:44. [PMID: 21595958 PMCID: PMC3123543 DOI: 10.1186/1755-8794-4-44] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2010] [Accepted: 05/19/2011] [Indexed: 12/21/2022] Open
Abstract
Background Changes in microRNA (miRNA) expression patterns have been extensively characterized in several cancers, including human colon cancer. However, how these miRNAs and their putative mRNA targets contribute to the etiology of cancer is poorly understood. In this work, a bioinformatics computational approach with miRNA and mRNA expression data was used to identify the putative targets of miRNAs and to construct association networks between miRNAs and mRNAs to gain some insights into the underlined molecular mechanisms of human colon cancer. Method The miRNA and mRNA microarray expression profiles from the same tissues including 7 human colon tumor tissues and 4 normal tissues, collected by the Broad Institute, were used to identify significant associations between miRNA and mRNA. We applied the partial least square (PLS) regression method and bootstrap based statistical tests to the joint expression profiles of differentially expressed miRNAs and mRNAs. From this analysis, we predicted putative miRNA targets and association networks between miRNAs and mRNAs. Pathway analysis was employed to identify biological processes related to these miRNAs and their associated predicted mRNA targets. Results Most significantly associated up-regulated mRNAs with a down-regulated miRNA identified by the proposed methodology were considered to be the miRNA targets. On average, approximately 16.5% and 11.0% of targets predicted by this approach were also predicted as targets by the common prediction algorithms TargetScan and miRanda, respectively. We demonstrated that our method detects more targets than a simple correlation based association. Integrative mRNA:miRNA predictive networks from our analysis were constructed with the aid of Cytoscape software. Pathway analysis validated the miRNAs through their predicted targets that may be involved in cancer-associated biological networks. Conclusion We have identified an alternative bioinformatics approach for predicting miRNA targets in human colon cancer and for reverse engineering the miRNA:mRNA network using inversely related mRNA and miRNA joint expression profiles. We demonstrated the superiority of our predictive method compared to the correlation based target prediction algorithm through a simulation study. We anticipate that the unique miRNA targets predicted by the proposed method will advance the understanding of the molecular mechanism of colon cancer and will suggest novel therapeutic targets after further experimental validations.
Collapse
Affiliation(s)
- Xiaohong Li
- Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY 40202, USA
| | | | | | | | | |
Collapse
|
37
|
Abstract
Dimension reduction, model and variable selection are ubiquitous concepts in modern statistical science and deriving new methods beyond the scope of current methodology is noteworthy. This article briefly reviews existing regularization methods for penalized least squares and likelihood for survival data and their extension to a certain class of penalized estimating function. We show that if one's goal is to estimate the entire regularized coefficient path using the observed survival data, then all current strategies fail for the Buckley-James estimating function. We propose a novel two-stage method to estimate and restore the entire Dantzig-regularized coefficient path for censored outcomes in a least-squares framework. We apply our methods to a microarray study of lung andenocarcinoma with sample size n = 200 and p = 1036 gene predictors and find 10 genes that are consistently selected across different criteria and an additional 14 genes that merit further investigation. In simulation studies, we found that the proposed path restoration and variable selection technique has the potential to perform as well as existing methods that begin with a proper convex loss function at the outset.
Collapse
Affiliation(s)
- Brent A Johnson
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia 30322, USA.
| | | | | |
Collapse
|
38
|
Binder H, Porzelius C, Schumacher M. An overview of techniques for linking high-dimensional molecular data to time-to-event endpoints by risk prediction models. Biom J 2011; 53:170-89. [PMID: 21328602 DOI: 10.1002/bimj.201000152] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2010] [Revised: 12/22/2010] [Accepted: 12/23/2010] [Indexed: 11/07/2022]
Abstract
Analysis of molecular data promises identification of biomarkers for improving prognostic models, thus potentially enabling better patient management. For identifying such biomarkers, risk prediction models can be employed that link high-dimensional molecular covariate data to a clinical endpoint. In low-dimensional settings, a multitude of statistical techniques already exists for building such models, e.g. allowing for variable selection or for quantifying the added value of a new biomarker. We provide an overview of techniques for regularized estimation that transfer this toward high-dimensional settings, with a focus on models for time-to-event endpoints. Techniques for incorporating specific covariate structure are discussed, as well as techniques for dealing with more complex endpoints. Employing gene expression data from patients with diffuse large B-cell lymphoma, some typical modeling issues from low-dimensional settings are illustrated in a high-dimensional application. First, the performance of classical stepwise regression is compared to stage-wise regression, as implemented by a component-wise likelihood-based boosting approach. A second issues arises, when artificially transforming the response into a binary variable. The effects of the resulting loss of efficiency and potential bias in a high-dimensional setting are illustrated, and a link to competing risks models is provided. Finally, we discuss conditions for adequately quantifying the added value of high-dimensional gene expression measurements, both at the stage of model fitting and when performing evaluation.
Collapse
Affiliation(s)
- Harald Binder
- Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg, Stefan-Meier-Str. 26, 79104 Freiburg, Germany.
| | | | | |
Collapse
|
39
|
Bøvelstad HM, Borgan O. Assessment of evaluation criteria for survival prediction from genomic data. Biom J 2011; 53:202-16. [PMID: 21308723 DOI: 10.1002/bimj.201000048] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2010] [Revised: 10/13/2010] [Accepted: 11/11/2010] [Indexed: 11/10/2022]
Abstract
Survival prediction from high-dimensional genomic data is dependent on a proper regularization method. With an increasing number of such methods proposed in the literature, comparative studies are called for and some have been performed. However, there is currently no consensus on which prediction assessment criterion should be used for time-to-event data. Without a firm knowledge about whether the choice of evaluation criterion may affect the conclusions made as to which regularization method performs best, these comparative studies may be of limited value. In this paper, four evaluation criteria are investigated: the log-rank test for two groups, the area under the time-dependent ROC curve (AUC), an R²-measure based on the Cox partial likelihood, and an R²-measure based on the Brier score. The criteria are compared according to how they rank six widely used regularization methods that are based on the Cox regression model, namely univariate selection, principal components regression (PCR), supervised PCR, partial least squares regression, ridge regression, and the lasso. Based on our application to three microarray gene expression data sets, we find that the results obtained from the widely used log-rank test deviate from the other three criteria studied. For future studies, where one also might want to include non-likelihood or non-model-based regularization methods, we argue in favor of AUC and the R²-measure based on the Brier score, as these do not suffer from the arbitrary splitting into two groups nor depend on the Cox partial likelihood.
Collapse
Affiliation(s)
- Hege M Bøvelstad
- Department of Mathematics, University of Oslo, PO Box 1053, Blindern, Oslo NO-0316, Norway.
| | | |
Collapse
|
40
|
Bonato V, Baladandayuthapani V, Broom BM, Sulman EP, Aldape KD, Do KA. Bayesian ensemble methods for survival prediction in gene expression data. Bioinformatics 2011; 27:359-67. [PMID: 21148161 PMCID: PMC3031034 DOI: 10.1093/bioinformatics/btq660] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION We propose a Bayesian ensemble method for survival prediction in high-dimensional gene expression data. We specify a fully Bayesian hierarchical approach based on an ensemble 'sum-of-trees' model and illustrate our method using three popular survival models. Our non-parametric method incorporates both additive and interaction effects between genes, which results in high predictive accuracy compared with other methods. In addition, our method provides model-free variable selection of important prognostic markers based on controlling the false discovery rates; thus providing a unified procedure to select relevant genes and predict survivor functions. RESULTS We assess the performance of our method several simulated and real microarray datasets. We show that our method selects genes potentially related to the development of the disease as well as yields predictive performance that is very competitive to many other existing methods. AVAILABILITY http://works.bepress.com/veera/1/.
Collapse
Affiliation(s)
- Vinicius Bonato
- Pfizer Inc., Groton, CT 06340, Department of Biostatistics, Department of Bioinformatics and Computational Biology, Department of Radiation Oncology and Department of Pathology, The University of Texas, M. D. Anderson Cancer Center, Houston, TX 77030, USA
| | - Veerabhadran Baladandayuthapani
- Pfizer Inc., Groton, CT 06340, Department of Biostatistics, Department of Bioinformatics and Computational Biology, Department of Radiation Oncology and Department of Pathology, The University of Texas, M. D. Anderson Cancer Center, Houston, TX 77030, USA,* To whom correspondence should be addressed
| | - Bradley M. Broom
- Pfizer Inc., Groton, CT 06340, Department of Biostatistics, Department of Bioinformatics and Computational Biology, Department of Radiation Oncology and Department of Pathology, The University of Texas, M. D. Anderson Cancer Center, Houston, TX 77030, USA
| | - Erik P. Sulman
- Pfizer Inc., Groton, CT 06340, Department of Biostatistics, Department of Bioinformatics and Computational Biology, Department of Radiation Oncology and Department of Pathology, The University of Texas, M. D. Anderson Cancer Center, Houston, TX 77030, USA
| | - Kenneth D. Aldape
- Pfizer Inc., Groton, CT 06340, Department of Biostatistics, Department of Bioinformatics and Computational Biology, Department of Radiation Oncology and Department of Pathology, The University of Texas, M. D. Anderson Cancer Center, Houston, TX 77030, USA
| | - Kim-Anh Do
- Pfizer Inc., Groton, CT 06340, Department of Biostatistics, Department of Bioinformatics and Computational Biology, Department of Radiation Oncology and Department of Pathology, The University of Texas, M. D. Anderson Cancer Center, Houston, TX 77030, USA
| |
Collapse
|
41
|
Abstract
In cancer research, high-throughput genomic studies have been extensively conducted, searching for markers associated with cancer diagnosis, prognosis and variation in response to treatment. In this article, we analyze cancer prognosis studies and investigate ranking markers based on their marginal prognosis power. To avoid ambiguity, we focus on microarray gene expression studies where genes are the markers, but note that the methodology and results are applicable to other high-throughput studies. The objectives of this study are 2-fold. First, we investigate ranking markers under three commonly adopted semiparametric models, namely the Cox, accelerated failure time and additive risk models. Data analysis shows that the ranking may vary significantly under different models. Second, we describe a nonparametric concordance measure, which has roots in the time-dependent ROC (receiver operating characteristic) framework and relies on much weaker assumptions than the semiparametric models. In simulation, it is shown that ranking using the concordance measure is not sensitive to model specification whereas ranking under the semiparametric models is. In data analysis, the concordance measure generates rankings significantly different from those under the semiparametric models.
Collapse
|
42
|
Chen X, Wang L, Ishwaran H. An Integrative Pathway-based Clinical-genomic Model for Cancer Survival Prediction. Stat Probab Lett 2010; 80:1313-1319. [PMID: 21731150 DOI: 10.1016/j.spl.2010.04.011] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
Prediction models that use gene expression levels are now being proposed for personalized treatment of cancer, but building accurate models that are easy to interpret remains a challenge. In this paper, we describe an integrative clinical-genomic approach that combines both genomic pathway and clinical information. First, we summarize information from genes in each pathway using Supervised Principal Components (SPCA) to obtain pathway-based genomic predictors. Next, we build a prediction model based on clinical variables and pathway-based genomic predictors using Random Survival Forests (RSF). Our rationale for this two-stage procedure is that the underlying disease process may be influenced by environmental exposure (measured by clinical variables) and perturbations in different pathways (measured by pathway-based genomic variables), as well as their interactions. Using two cancer microarray datasets, we show that the pathway-based clinical-genomic model outperforms gene-based clinical-genomic models, with improved prediction accuracy and interpretability.
Collapse
Affiliation(s)
- Xi Chen
- Division of Cancer Biostatistics, Department of Biostatistics, Vanderbilt University, Nashville, TN 37232, USA
| | | | | |
Collapse
|
43
|
A Bayesian hierarchical model for high-dimensional meta-analysis. Methods Mol Biol 2010. [PMID: 20652520 DOI: 10.1007/978-1-60761-580-4_20] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
Abstract
Many biomedical applications are concerned with the problem of selecting important predictors from a high-dimensional set of candidates, with the gene expression data as one example. Due to the fact that the sample size in any single study is usually small, it is thus important to combine information from multiple studies. In this chapter, we introduce a Bayesian hierarchical modeling approach which models study-to-study heterogeneity explicitly to borrow strength across studies. Using a carefully formulated prior specification, we develop a fast approach to predictor selection and shrinkage estimation for high-dimensional predictors. The proposed approach, which is related to the relevance vector machine (RVM), relies on maximum a posteriori (MAP) estimation to rapidly obtain a sparse estimate. As for the typical RVM, there is an intrinsic thresholding property in which unimportant predictors tend to have their coefficients shrunk to zero. The method will be illustrated with an application of selecting genes as predictors of time to an event.
Collapse
|
44
|
Ma S, Huang J, Shi M, Li Y, Shia BC. Semiparametric prognosis models in genomic studies. Brief Bioinform 2010; 11:385-93. [PMID: 20123942 PMCID: PMC2905523 DOI: 10.1093/bib/bbp070] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2009] [Revised: 12/04/2009] [Indexed: 11/12/2022] Open
Abstract
Development of high-throughput technologies makes it possible to survey the whole genome. Genomic studies have been extensively conducted, searching for markers with predictive power for prognosis of complex diseases such as cancer, diabetes and obesity. Most existing statistical analyses are focused on developing marker selection techniques, while little attention is paid to the underlying prognosis models. In this article, we review three commonly used prognosis models, namely the Cox, additive risk and accelerated failure time models. We conduct simulation and show that gene identification can be unsatisfactory under model misspecification. We analyze three cancer prognosis studies under the three models, and show that the gene identification results, prediction performance of all identified genes combined, and reproducibility of each identified gene are model-dependent. We suggest that in practical data analysis, more attention should be paid to the model assumption, and multiple models may need to be considered.
Collapse
Affiliation(s)
- Shuangge Ma
- School of Public Health, Yale University, USA.
| | | | | | | | | |
Collapse
|
45
|
Buckley-James boosting for survival analysis with high-dimensional biomarker data. Stat Appl Genet Mol Biol 2010; 9:Article24. [PMID: 20597850 DOI: 10.2202/1544-6115.1550] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
There has been increasing interest in predicting patients' survival after therapy by investigating gene expression microarray data. In the regression and classification models with high-dimensional genomic data, boosting has been successfully applied to build accurate predictive models and conduct variable selection simultaneously. We propose the Buckley-James boosting for the semiparametric accelerated failure time models with right censored survival data, which can be used to predict survival of future patients using the high-dimensional genomic data. In the spirit of adaptive LASSO, twin boosting is also incorporated to fit more sparse models. The proposed methods have a unified approach to fit linear models, non-linear effects models with possible interactions. The methods can perform variable selection and parameter estimation simultaneously. The proposed methods are evaluated by simulations and applied to a recent microarray gene expression data set for patients with diffuse large B-cell lymphoma under the current gold standard therapy.
Collapse
|
46
|
Sensitivity of methods for estimating breeding values using genetic markers to the number of QTL and distribution of QTL variance. Genet Sel Evol 2010; 42:9. [PMID: 20302681 PMCID: PMC2851578 DOI: 10.1186/1297-9686-42-9] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2009] [Accepted: 03/22/2010] [Indexed: 11/19/2022] Open
Abstract
The objective of this simulation study was to compare the effect of the number of QTL and distribution of QTL variance on the accuracy of breeding values estimated with genomewide markers (MEBV). Three distinct methods were used to calculate MEBV: a Bayesian Method (BM), Least Angle Regression (LARS) and Partial Least Square Regression (PLSR). The accuracy of MEBV calculated with BM and LARS decreased when the number of simulated QTL increased. The accuracy decreased more when QTL had different variance values than when all QTL had an equal variance. The accuracy of MEBV calculated with PLSR was affected neither by the number of QTL nor by the distribution of QTL variance. Additional simulations and analyses showed that these conclusions were not affected by the number of individuals in the training population, by the number of markers and by the heritability of the trait. Results of this study show that the effect of the number of QTL and distribution of QTL variance on the accuracy of MEBV depends on the method that is used to calculate MEBV.
Collapse
|
47
|
Devarajan K, Zhou Y, Chachra N, Ebrahimi N. A supervised approach for predicting patient survival with gene expression data. PROCEEDINGS. IEEE INTERNATIONAL SYMPOSIUM ON BIOINFORMATICS AND BIOENGINEERING 2010; 2010:26-31. [PMID: 20865131 PMCID: PMC2941901 DOI: 10.1109/bibe.2010.14] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Rapid development in genomics in recent years has allowed the simultaneous measurement of the expression levels of thousands of genes using DNA microarrays. This has offered tremendous potential for growth in our understanding of the pathophysiology of many diseases. When microarray studies also contain information about an outcome variable such as time to an event or death, one of the goals of an investigator is to understand how the expression levels of genes (covariates) relate to the time-to-event (referred to as survival time) in the course of a disease.In this article, we consider the case where the number of covariates, p, exceeds the number of observations, N, a setting typical of microarray gene expression data. For a given vector of responses representing survival times of N subjects and the corresponding p × N gene expression matrix, we examine the problem of predicting the survival probability when N ≪ p. This is an ill-conditioned problem further compounded by the presence of possibly censored survival times. We propose a model that combines the partial least squares approach for dimensionality reduction with the accelerated failure time model, a widely used log-linear model for linking censored survival time to covariates. We develop parametric methods to account for censoring as well as for predicting patient survival probabilities. We illustrate the applicability of our methods using cancer microarray data and explore the biological relevance of our results using pathway analysis. Finally, we evaluate the performance of our methods using extensive simulation studies.
Collapse
Affiliation(s)
- Karthik Devarajan
- Division of Population Science, Fox Chase Cancer Center, Philadelphia, PA 19111,
| | - Yan Zhou
- Division of Population Science, Fox Chase Cancer Center, Philadelphia, PA 19111,
| | | | - Nader Ebrahimi
- Division of Statistics, Northern Illinois University, DeKalb, IL 60115,
| |
Collapse
|
48
|
Nguyen TS, Rojo J. Dimension reduction of microarray gene expression data: the accelerated failure time model. J Bioinform Comput Biol 2009; 7:939-54. [PMID: 20014472 PMCID: PMC2796584 DOI: 10.1142/s0219720009004412] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2009] [Revised: 06/26/2009] [Accepted: 08/02/2009] [Indexed: 03/13/2024]
Abstract
The construction of the components of Partial Least Squares (PLS) is based on the maximization of the covariance/correlation between linear combinations of the predictors and the response. However, the usual Pearson correlation is influenced by outliers in the response or in the predictors. To cope with outliers, we replace the Pearson correlation with the Spearman rank correlation in the optimization criteria of PLS. The rank-based method of PLS is insensitive to outlying values in both the predictors and response, and incorporates the censoring information by using an approach of Nguyen and Rocke (2004) and two approaches of reweighting and mean imputation of Datta et al. (2007). The performance of the rank-based approaches of PLS, denoted by Rank-based Modified Partial Least Squares (RMPLS), Rank-based Reweighted Partial Least Squares (RRWPLS), and Rank-based Mean-Imputation Partial Least Squares (RMIPLS), is investigated in a simulation study and on four real datasets, under an Accelerated Failure Time (AFT) model, against their un-ranked counterparts, and several other dimension reduction techniques. The results indicate that RMPLS is a better dimension reduction method than other variants of PLS as well as other considered methods in terms of the minimized cross-validation error of fit and the mean squared error of fit in the presence of outliers in the response, and is comparable to other variants of PLS in the absence of outliers. Supplementary Materials are available at http://www.worldscinet.com/jbcb/
Collapse
Affiliation(s)
- Tuan S. Nguyen
- Statistics Department, MS 138, Rice University, 6100 Main Street, Houston, Texas 77005
| | - Javier Rojo
- Statistics Department, MS 138, Rice University, 6100 Main Street, Houston, Texas 77005
| |
Collapse
|
49
|
Abstract
SUMMARY In the presence of high-dimensional predictors, it is challenging to develop reliable regression models that can be used to accurately predict future outcomes. Further complications arise when the outcome of interest is an event time, which is often not fully observed due to censoring. In this article, we develop robust prediction models for event time outcomes by regularizing the Gehan's estimator for the accelerated failure time (AFT) model (Tsiatis, 1996, Annals of Statistics 18, 305-328) with least absolute shrinkage and selection operator (LASSO) penalty. Unlike existing methods based on the inverse probability weighting and the Buckley and James estimator (Buckley and James, 1979, Biometrika 66, 429-436), the proposed approach does not require additional assumptions about the censoring and always yields a solution that is convergent. Furthermore, the proposed estimator leads to a stable regression model for prediction even if the AFT model fails to hold. To facilitate the adaptive selection of the tuning parameter, we detail an efficient numerical algorithm for obtaining the entire regularization path. The proposed procedures are applied to a breast cancer dataset to derive a reliable regression model for predicting patient survival based on a set of clinical prognostic factors and gene signatures. Finite sample performances of the procedures are evaluated through a simulation study.
Collapse
Affiliation(s)
- T Cai
- Department of Biostatistics, Harvard University, Boston, Massachusetts 02115, USA.
| | | | | |
Collapse
|
50
|
Johnson BA. Rank-based estimation in the {ell}1-regularized partly linear model for censored outcomes with application to integrated analyses of clinical predictors and gene expression data. Biostatistics 2009; 10:659-66. [PMID: 19553356 DOI: 10.1093/biostatistics/kxp020] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
We consider estimation and variable selection in the partial linear model for censored data. The partial linear model for censored data is a direct extension of the accelerated failure time model, the latter of which is a very important alternative model to the proportional hazards model. We extend rank-based lasso-type estimators to a model that may contain nonlinear effects. Variable selection in such partial linear model has direct application to high-dimensional survival analyses that attempt to adjust for clinical predictors. In the microarray setting, previous methods can adjust for other clinical predictors by assuming that clinical and gene expression data enter the model linearly in the same fashion. Here, we select important variables after adjusting for prognostic clinical variables but the clinical effects are assumed nonlinear. Our estimator is based on stratification and can be extended naturally to account for multiple nonlinear effects. We illustrate the utility of our method through simulation studies and application to the Wisconsin prognostic breast cancer data set.
Collapse
Affiliation(s)
- Brent A Johnson
- Department of Biostatistics, Emory University, Atlanta, GA 30322, USA.
| |
Collapse
|