1
|
Bindele HF, Denhere M, Sun W. Generalized signed-rank estimation and selection for the functional linear model. STATISTICS-ABINGDON 2022. [DOI: 10.1080/02331888.2022.2084094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Huybrechts F. Bindele
- Department of Mathematics and Statistics, University of South Alabama, Mobile, AL, USA
| | | | - Wei Sun
- Auburn University, Auburn, AL, USA
| |
Collapse
|
2
|
Suder PM, Molstad AJ. Scalable algorithms for semiparametric accelerated failure time models in high dimensions. Stat Med 2022; 41:933-949. [PMID: 35014701 DOI: 10.1002/sim.9264] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2021] [Revised: 09/21/2021] [Accepted: 10/29/2021] [Indexed: 11/11/2022]
Abstract
Semiparametric accelerated failure time (AFT) models are a useful alternative to Cox proportional hazards models, especially when the assumption of constant hazard ratios is untenable. However, rank-based criteria for fitting AFT models are often nondifferentiable, which poses a computational challenge in high-dimensional settings. In this article, we propose a new alternating direction method of multipliers algorithm for fitting semiparametric AFT models by minimizing a penalized rank-based loss function. Our algorithm scales well in both the number of subjects and number of predictors, and can easily accommodate a wide range of popular penalties. To improve the selection of tuning parameters, we propose a new criterion which avoids some common problems in cross-validation with censored responses. Through extensive simulation studies, we show that our algorithm and software is much faster than existing methods (which can only be applied to special cases), and we show that estimators which minimize a penalized rank-based criterion often outperform alternative estimators which minimize penalized weighted least squares criteria. Application to nine cancer datasets further demonstrates that rank-based estimators of semiparametric AFT models are competitive with estimators assuming proportional hazards in high-dimensional settings, whereas weighted least squares estimators are often not. A software package implementing the algorithm, along with a set of auxiliary functions, is available for download at github.com/ajmolstad/penAFT.
Collapse
Affiliation(s)
- Piotr M Suder
- Department of Statistics, University of Florida, Gainesville, Florida, USA
| | - Aaron J Molstad
- Department of Statistics, University of Florida, Gainesville, Florida, USA.,Genetics Institute, University of Florida, Gainesville, Florida, USA
| |
Collapse
|
3
|
Variable selection in partially linear additive hazards model with grouped covariates and a diverging number of parameters. Comput Stat 2021. [DOI: 10.1007/s00180-020-01062-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
4
|
Huang L, Kopciuk K, Lu X. A group bridge approach for component selection in nonparametric accelerated failure time additive regression model. COMMUN STAT-THEOR M 2021. [DOI: 10.1080/03610926.2019.1651861] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Affiliation(s)
- Longlong Huang
- Department of Mathematics and Statistics, University of Calgary, Calgary, Alberta, Canada
| | - Karen Kopciuk
- Department of Mathematics and Statistics, University of Calgary, Calgary, Alberta, Canada
- Department of Cancer Epidemiology and Prevention Research, Alberta Health Services, Calgary, Alberta, Canada
| | - Xuewen Lu
- Department of Mathematics and Statistics, University of Calgary, Calgary, Alberta, Canada
| |
Collapse
|
5
|
Huang H, Shangguan J, Li X, Liang H. High-dimensional single-index models with censored responses. Stat Med 2020; 39:2743-2754. [PMID: 32379359 DOI: 10.1002/sim.8571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2019] [Revised: 04/04/2020] [Accepted: 04/15/2020] [Indexed: 11/09/2022]
Abstract
In this article, we study the estimation of high-dimensional single index models when the response variable is censored. We hybrid the estimation methods for high-dimensional single-index models (but without censorship) and univariate nonparametric models with randomly censored responses to estimate the index parameters and the link function and apply the proposed methods to analyze a genomic dataset from a study of diffuse large B-cell lymphoma. We evaluate the finite sample performance of the proposed procedures via simulation studies and establish large sample theories for the proposed estimators of the index parameter and the nonparametric link function under certain regularity conditions.
Collapse
Affiliation(s)
- Hailin Huang
- Department of Statistics, George Washington University, Washington, District of Columbia, USA
| | - Jizi Shangguan
- Department of Statistics, George Washington University, Washington, District of Columbia, USA
| | - Xinmin Li
- School of Mathematics and Statistics, Qingdao University, Shandong, China
| | - Hua Liang
- Department of Statistics, George Washington University, Washington, District of Columbia, USA
| |
Collapse
|
6
|
Bindele HF, Abebe A, Zeng P. Robust estimation and selection for single-index regression model. J STAT COMPUT SIM 2019. [DOI: 10.1080/00949655.2019.1581781] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
- Huybrechts F. Bindele
- Department of Mathematics and Statistics, University of South Alabama, Mobile, AL, USA
| | - Asheber Abebe
- Department of Mathematics and Statistics, Auburn University, Auburn, AL, USA
| | - Peng Zeng
- Department of Mathematics and Statistics, Auburn University, Auburn, AL, USA
| |
Collapse
|
7
|
Chai H, Zhang Q, Huang J, Ma S. INFERENCE FOR LOW-DIMENSIONAL COVARIATES IN A HIGH-DIMENSIONAL ACCELERATED FAILURE TIME MODEL. Stat Sin 2019; 29:877-894. [PMID: 31073263 PMCID: PMC6502249 DOI: 10.5705/ss.202016.0449] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Data with high-dimensional covariates are now commonly encountered. Compared to other types of responses, research on high-dimensional data with censored survival responses is still relatively limited, and most of the existing studies have been focused on estimation and variable selection. In this study, we consider data with a censored survival response, a set of low-dimensional covariates of main interest, and a set of high-dimensional covariates that may also affect survival. The accelerated failure time model is adopted to describe survival. The goal is to conduct inference for the effects of low-dimensional covariates, while properly accounting for the high-dimensional covariates. A penalization-based procedure is developed, and its validity is established under mild and widely adopted conditions. Simulation suggests satisfactory performance of the proposed procedure, and the analysis of two cancer genetic datasets demonstrates its practical applicability.
Collapse
|
8
|
Johnson BA, Long Q, Huang Y, Chansky K, Redman M. Model selection and inference for censored lifetime medical expenditures. Biometrics 2016; 72:731-41. [PMID: 26689300 PMCID: PMC5741192 DOI: 10.1111/biom.12464] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2015] [Revised: 11/01/2015] [Accepted: 11/01/2015] [Indexed: 11/30/2022]
Abstract
Identifying factors associated with increased medical cost is important for many micro- and macro-institutions, including the national economy and public health, insurers and the insured. However, assembling comprehensive national databases that include both the cost and individual-level predictors can prove challenging. Alternatively, one can use data from smaller studies with the understanding that conclusions drawn from such analyses may be limited to the participant population. At the same time, smaller clinical studies have limited follow-up and lifetime medical cost may not be fully observed for all study participants. In this context, we develop new model selection methods and inference procedures for secondary analyses of clinical trial data when lifetime medical cost is subject to induced censoring. Our model selection methods extend a theory of penalized estimating function to a calibration regression estimator tailored for this data type. Next, we develop a novel inference procedure for the unpenalized regression estimator using perturbation and resampling theory. Then, we extend this resampling plan to accommodate regularized coefficient estimation of censored lifetime medical cost and develop postselection inference procedures for the final model. Our methods are motivated by data from Southwest Oncology Group Protocol 9509, a clinical trial of patients with advanced nonsmall cell lung cancer, and our models of lifetime medical cost are specific to this population. But the methods presented in this article are built on rather general techniques and could be applied to larger databases as those data become available.
Collapse
Affiliation(s)
- Brent A Johnson
- Department of Biostatistics and Computational Biology, University of Rochester, Rochester, New York, U.S.A..
| | - Qi Long
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia, U.S.A
| | - Yijian Huang
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia, U.S.A
| | - Kari Chansky
- The Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, U.S.A
| | - Mary Redman
- The Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, U.S.A
| |
Collapse
|
9
|
Kim S, Halabi S. High Dimensional Variable Selection with Error Control. BIOMED RESEARCH INTERNATIONAL 2016; 2016:8209453. [PMID: 27597974 PMCID: PMC5002494 DOI: 10.1155/2016/8209453] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/03/2016] [Accepted: 05/25/2016] [Indexed: 11/17/2022]
Abstract
Background. The iterative sure independence screening (ISIS) is a popular method in selecting important variables while maintaining most of the informative variables relevant to the outcome in high throughput data. However, it not only is computationally intensive but also may cause high false discovery rate (FDR). We propose to use the FDR as a screening method to reduce the high dimension to a lower dimension as well as controlling the FDR with three popular variable selection methods: LASSO, SCAD, and MCP. Method. The three methods with the proposed screenings were applied to prostate cancer data with presence of metastasis as the outcome. Results. Simulations showed that the three variable selection methods with the proposed screenings controlled the predefined FDR and produced high area under the receiver operating characteristic curve (AUROC) scores. In applying these methods to the prostate cancer example, LASSO and MCP selected 12 and 8 genes and produced AUROC scores of 0.746 and 0.764, respectively. Conclusions. We demonstrated that the variable selection methods with the sequential use of FDR and ISIS not only controlled the predefined FDR in the final models but also had relatively high AUROC scores.
Collapse
Affiliation(s)
- Sangjin Kim
- Department of Biostatistics and Bioinformatics, Duke University Medical Center, Box 2717, Durham, NC 27710, USA
| | - Susan Halabi
- Department of Biostatistics and Bioinformatics, Duke University Medical Center, Box 2717, Durham, NC 27710, USA
| |
Collapse
|
10
|
Wu C, Ma S. A selective review of robust variable selection with applications in bioinformatics. Brief Bioinform 2015; 16:873-83. [PMID: 25479793 PMCID: PMC4570200 DOI: 10.1093/bib/bbu046] [Citation(s) in RCA: 61] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2014] [Revised: 10/20/2014] [Indexed: 11/13/2022] Open
Abstract
A drastic amount of data have been and are being generated in bioinformatics studies. In the analysis of such data, the standard modeling approaches can be challenged by the heavy-tailed errors and outliers in response variables, the contamination in predictors (which may be caused by, for instance, technical problems in microarray gene expression studies), model mis-specification and others. Robust methods are needed to tackle these challenges. When there are a large number of predictors, variable selection can be as important as estimation. As a generic variable selection and regularization tool, penalization has been extensively adopted. In this article, we provide a selective review of robust penalized variable selection approaches especially designed for high-dimensional data from bioinformatics and biomedical studies. We discuss the robust loss functions, penalty functions and computational algorithms. The theoretical properties and implementation are also briefly examined. Application examples of the robust penalization approaches in representative bioinformatics and biomedical studies are also illustrated.
Collapse
|
11
|
Adjusted regularized estimation in the accelerated failure time model with high dimensional covariates. J MULTIVARIATE ANAL 2013. [DOI: 10.1016/j.jmva.2013.07.011] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
12
|
Chung M, Long Q, Johnson BA. A Tutorial on Rank-based Coefficient Estimation for Censored Data in Small- and Large-Scale Problems. STATISTICS AND COMPUTING 2013; 23:601-614. [PMID: 23956500 PMCID: PMC3742389 DOI: 10.1007/s11222-012-9333-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
The analysis of survival endpoints subject to right-censoring is an important research area in statistics, particularly among econometricians and biostatisticians. The two most popular semiparametric models are the proportional hazards model and the accelerated failure time (AFT) model. Rank-based estimation in the AFT model is computationally challenging due to optimization of a non-smooth loss function. Previous work has shown that rank-based estimators may be written as solutions to linear programming (LP) problems. However, the size of the LP problem is O(n2 + p) subject to n2 linear constraints, where n denotes sample size and p denotes the dimension of parameters. As n and/or p increases, the feasibility of such solution in practice becomes questionable. Among data mining and statistical learning enthusiasts, there is interest in extending ordinary regression coefficient estimators for low-dimensions into high-dimensional data mining tools through regularization. Applying this recipe to rank-based coefficient estimators leads to formidable optimization problems which may be avoided through smooth approximations to non-smooth functions. We review smooth approximations and quasi-Newton methods for rank-based estimation in AFT models. The computational cost of our method is substantially smaller than the corresponding LP problem and can be applied to small- or large-scale problems similarly. The algorithm described here allows one to couple rank-based estimation for censored data with virtually any regularization and is exemplified through four case studies.
Collapse
Affiliation(s)
- Matthias Chung
- Department of Mathematics, Texas State University, San Marcos, TX 78666, U.S.A
| | - Qi Long
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA 30322, U.S.A
| | - Brent A. Johnson
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA 30322, U.S.A
| |
Collapse
|
13
|
Ma S, Du P. VARIABLE SELECTION IN PARTLY LINEAR REGRESSION MODEL WITH DIVERGING DIMENSIONS FOR RIGHT CENSORED DATA. Stat Sin 2012; 22:1003-1020. [PMID: 23956611 PMCID: PMC3744344 DOI: 10.5705/ss.2010.267] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Recent biomedical studies often measure two distinct sets of risk factors: low-dimensional clinical and environmental measurements, and high-dimensional gene expression measurements. For prognosis studies with right censored response variables, we propose a semiparametric regression model whose covariate effects have two parts: a nonparametric part for low-dimensional covariates, and a parametric part for high-dimensional covariates. A penalized variable selection approach is developed. The selection of parametric covariate effects is achieved using an iterated Lasso approach, for which we prove the selection consistency property. The nonparametric component is estimated using a sieve approach. An empirical model selection tool for the nonparametric component is derived based on the Kullback-Leibler geometry. Numerical studies show that the proposed approach has satisfactory performance. Application to a lymphoma study illustrates the proposed method.
Collapse
Affiliation(s)
- Shuangge Ma
- School Public Health, Yale University, New Haven, CT 06520, U.S.A
| | - Pang Du
- Department of Statistics, Virginia Tech, Blacksburg, VA 24061, U.S.A
| |
Collapse
|
14
|
Long Q, Chung M, Moreno CS, Johnson BA. Risk Prediction for Prostate Cancer Recurrence Through Regularized Estimation with Simultaneous Adjustment for Nonlinear Clinical Effects. Ann Appl Stat 2011; 5:2003-2023. [PMID: 22081781 PMCID: PMC3212400 DOI: 10.1214/11-aoas458] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
In biomedical studies, it is of substantial interest to develop risk prediction scores using high-dimensional data such as gene expression data for clinical endpoints that are subject to censoring. In the presence of well-established clinical risk factors, investigators often prefer a procedure that also adjusts for these clinical variables. While accelerated failure time (AFT) models are a useful tool for the analysis of censored outcome data, it assumes that covariate effects on the logarithm of time-to-event are linear, which is often unrealistic in practice. We propose to build risk prediction scores through regularized rank estimation in partly linear AFT models, where high-dimensional data such as gene expression data are modeled linearly and important clinical variables are modeled nonlinearly using penalized regression splines. We show through simulation studies that our model has better operating characteristics compared to several existing models. In particular, we show that there is a non-negligible effect on prediction as well as feature selection when nonlinear clinical effects are misspecified as linear. This work is motivated by a recent prostate cancer study, where investigators collected gene expression data along with established prognostic clinical variables and the primary endpoint is time to prostate cancer recurrence. We analyzed the prostate cancer data and evaluated prediction performance of several models based on the extended c statistic for censored data, showing that 1) the relationship between the clinical variable, prostate specific antigen, and the prostate cancer recurrence is likely nonlinear, i.e., the time to recurrence decreases as PSA increases and it starts to level off when PSA becomes greater than 11; 2) correct specification of this nonlinear effect improves performance in prediction and feature selection; and 3) addition of gene expression data does not seem to further improve the performance of the resultant risk prediction scores.
Collapse
Affiliation(s)
- Qi Long
- Department of Biostatistics and Bioinformatics Emory University Atlanta, GA 30322, USA
| | - Matthias Chung
- Department of Mathematics Texas State University San Marcos, TX 78666, USA
| | - Carlos S. Moreno
- Department of Pathology and Laboratory Medicine Emory University Atlanta, GA 30322, USA
| | - Brent A. Johnson
- Department of Biostatistics and Bioinformatics Emory University Atlanta, GA 30322, USA
| |
Collapse
|
15
|
Abstract
Dimension reduction, model and variable selection are ubiquitous concepts in modern statistical science and deriving new methods beyond the scope of current methodology is noteworthy. This article briefly reviews existing regularization methods for penalized least squares and likelihood for survival data and their extension to a certain class of penalized estimating function. We show that if one's goal is to estimate the entire regularized coefficient path using the observed survival data, then all current strategies fail for the Buckley-James estimating function. We propose a novel two-stage method to estimate and restore the entire Dantzig-regularized coefficient path for censored outcomes in a least-squares framework. We apply our methods to a microarray study of lung andenocarcinoma with sample size n = 200 and p = 1036 gene predictors and find 10 genes that are consistently selected across different criteria and an additional 14 genes that merit further investigation. In simulation studies, we found that the proposed path restoration and variable selection technique has the potential to perform as well as existing methods that begin with a proper convex loss function at the outset.
Collapse
Affiliation(s)
- Brent A Johnson
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia 30322, USA.
| | | | | |
Collapse
|
16
|
Zou Y, Zhang J, Qin G. Semiparametric Accelerated Failure Time Partial Linear Model and Its Application to Breast Cancer. Comput Stat Data Anal 2011; 55:1479-1487. [PMID: 21499529 DOI: 10.1016/j.csda.2010.10.012] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Breast cancer is the most common non-skin cancer in women and the second most common cause of cancer-related death in U.S. women. It is well known that the breast cancer survival varies by age at diagnosis. For most cancers, the relative survival decreases with age but breast cancer may have the unusual age pattern. In order to reveal the stage risk and age effects pattern, we propose the semiparametric accelerated failure time partial linear model and develop its estimation method based on the P-spline and the rank estimation approach. The simulation studies demonstrate that the proposed method is comparable to the parametric approach when data is not contaminated, and more stable than the parametric methods when data is contaminated. By applying the proposed model and method to the breast cancer data set of Atlantic county, New Jersey from SEER program, we successfully reveal the significant effects of stage, and show that women diagnosed around 38s have consistently higher survival rates than either younger or older women.
Collapse
Affiliation(s)
- Yubo Zou
- Department of Epidemiology and Biostatistics, University of South Carolina Columbia, SC 29208, USA
| | | | | |
Collapse
|
17
|
|