1
|
Qu Y, Cheng Y. Volume under the ROC surface for high-dimensional independent screening with ordinal competing risk outcomes. LIFETIME DATA ANALYSIS 2023; 29:735-751. [PMID: 37160816 DOI: 10.1007/s10985-023-09600-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/03/2022] [Accepted: 04/08/2023] [Indexed: 05/11/2023]
Abstract
We propose a screening method for high-dimensional data with ordinal competing risk outcomes, which is time-dependent and model-free. Existing methods are designed for cause-specific variable screening and fail to evaluate how a biomarker is associated with multiple competing events simultaneously. The proposed method utilizes the Volume under the ROC surface (VUS), which measures the concordance between values of a biomarker and event status at certain time points and provides an overall evaluation of the discrimination capacity of a biomarker. We show that the VUS possesses the sure screening property, i.e., true important covariates can be retained with probability tending to one, and the size of the selected set can be bounded with high probability. The VUS appears to be a viable model-free screening metric as compared to some existing methods in simulation studies, and it is especially robust to data contamination. Through an analysis of breast-cancer gene-expression data, we illustrate the unique insights into the overall discriminatory capability provided by the VUS.
Collapse
Affiliation(s)
- Yang Qu
- Department of Statistics, University of Pittsburgh, Pittsburgh, PA, 15260, USA
| | - Yu Cheng
- Department of Statistics, University of Pittsburgh, Pittsburgh, PA, 15260, USA.
| |
Collapse
|
2
|
Huang TJ, Luedtke A, McKeague IW. EFFICIENT ESTIMATION OF THE MAXIMAL ASSOCIATION BETWEEN MULTIPLE PREDICTORS AND A SURVIVAL OUTCOME. Ann Stat 2023; 51:1965-1988. [PMID: 38405375 PMCID: PMC10888526 DOI: 10.1214/23-aos2313] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
This paper develops a new approach to post-selection inference for screening high-dimensional predictors of survival outcomes. Post-selection inference for right-censored outcome data has been investigated in the literature, but much remains to be done to make the methods both reliable and computationally-scalable in high-dimensions. Machine learning tools are commonly used to provide predictions of survival outcomes, but the estimated effect of a selected predictor suffers from confirmation bias unless the selection is taken into account. The new approach involves the construction of semi-parametrically efficient estimators of the linear association between the predictors and the survival outcome, which are used to build a test statistic for detecting the presence of an association between any of the predictors and the outcome. Further, a stabilization technique reminiscent of bagging allows a normal calibration for the resulting test statistic, which enables the construction of confidence intervals for the maximal association between predictors and the outcome and also greatly reduces computational cost. Theoretical results show that this testing procedure is valid even when the number of predictors grows superpolynomially with sample size, and our simulations support this asymptotic guarantee at moderate sample sizes. The new approach is applied to the problem of identifying patterns in viral gene expression associated with the potency of an antiviral drug.
Collapse
Affiliation(s)
- Tzu-Jung Huang
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center
| | - Alex Luedtke
- Department of Statistics, University of Washington
| | | |
Collapse
|
3
|
Salerno S, Li Y. High-Dimensional Survival Analysis: Methods and Applications. ANNUAL REVIEW OF STATISTICS AND ITS APPLICATION 2023; 10:25-49. [PMID: 36968638 PMCID: PMC10038209 DOI: 10.1146/annurev-statistics-032921-022127] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
In the era of precision medicine, time-to-event outcomes such as time to death or progression are routinely collected, along with high-throughput covariates. These high-dimensional data defy classical survival regression models, which are either infeasible to fit or likely to incur low predictability due to over-fitting. To overcome this, recent emphasis has been placed on developing novel approaches for feature selection and survival prognostication. We will review various cutting-edge methods that handle survival outcome data with high-dimensional predictors, highlighting recent innovations in machine learning approaches for survival prediction. We will cover the statistical intuitions and principles behind these methods and conclude with extensions to more complex settings, where competing events are observed. We exemplify these methods with applications to the Boston Lung Cancer Survival Cohort study, one of the largest cancer epidemiology cohorts investigating the complex mechanisms of lung cancer.
Collapse
Affiliation(s)
- Stephen Salerno
- Department of Biostatistics, University of Michigan, Ann Arbor, United States, 48109
| | - Yi Li
- Department of Biostatistics, University of Michigan, Ann Arbor, United States, 48109
| |
Collapse
|
4
|
Ke C, Bandyopadhyay D, Acunzo M, Winn R. Gene Screening in High-Throughput Right-Censored Lung Cancer Data. ONCO 2022; 2:305-318. [PMID: 37066112 PMCID: PMC10100230 DOI: 10.3390/onco2040017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/18/2023]
Abstract
Background Advances in sequencing technologies have allowed collection of massive genome-wide information that substantially advances lung cancer diagnosis and prognosis. Identifying influential markers for clinical endpoints of interest has been an indispensable and critical component of the statistical analysis pipeline. However, classical variable selection methods are not feasible or reliable for high-throughput genetic data. Our objective is to propose a model-free gene screening procedure for high-throughput right-censored data, and to develop a predictive gene signature for lung squamous cell carcinoma (LUSC) with the proposed procedure. Methods A gene screening procedure was developed based on a recently proposed independence measure. The Cancer Genome Atlas (TCGA) data on LUSC was then studied. The screening procedure was conducted to narrow down the set of influential genes to 378 candidates. A penalized Cox model was then fitted to the reduced set, which further identified a 6-gene signature for LUSC prognosis. The 6-gene signature was validated on datasets from the Gene Expression Omnibus. Results Both model-fitting and validation results reveal that our method selected influential genes that lead to biologically sensible findings as well as better predictive performance, compared to existing alternatives. According to our multivariable Cox regression analysis, the 6-gene signature was indeed a significant prognostic factor (p-value < 0.001) while controlling for clinical covariates. Conclusions Gene screening as a fast dimension reduction technique plays an important role in analyzing high-throughput data. The main contribution of this paper is to introduce a fundamental yet pragmatic model-free gene screening approach that aids statistical analysis of right-censored cancer data, and provide a lateral comparison with other available methods in the context of LUSC.
Collapse
Affiliation(s)
- Chenlu Ke
- Department of Statistical Sciences and Operations Research, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Dipankar Bandyopadhyay
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA 23284, USA
- Correspondence: ; Tel.: +1-804-827-2058
| | - Mario Acunzo
- Department of Internal Medicine, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Robert Winn
- Massey Cancer Center, Virginia Commonwealth University, Richmond, VA 23284, USA
| |
Collapse
|
5
|
Song F, Lai P, Shen B, Zhu L. Model free feature screening for ultrahigh dimensional covariates with right censored outcomes. COMMUN STAT-SIMUL C 2022. [DOI: 10.1080/03610918.2020.1775848] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Affiliation(s)
- Fengli Song
- School of Mathematics and Statistics, Nanjing University of Information Science and Technology, Nanjing, China
| | - Peng Lai
- School of Mathematics and Statistics, Nanjing University of Information Science and Technology, Nanjing, China
| | - Baohua Shen
- School of Mathematics and Statistics, Nanjing University of Information Science and Technology, Nanjing, China
| | - Lianhua Zhu
- School of Mathematics and Statistics, Nanjing University of Information Science and Technology, Nanjing, China
| |
Collapse
|
6
|
Qu L, Wang X, Sun L. Variable screening for varying coefficient models with ultrahigh-dimensional survival data. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2022.107498] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
7
|
Zhong W, Wang J, Chen X. Censored mean variance sure independence screening for ultrahigh dimensional survival data. Comput Stat Data Anal 2021. [DOI: 10.1016/j.csda.2021.107206] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
|
8
|
Li J, Yu T, Lv J, Lee MT. Semiparametric model averaging prediction for lifetime data via hazards regression. J R Stat Soc Ser C Appl Stat 2021. [DOI: 10.1111/rssc.12502] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Affiliation(s)
- Jialiang Li
- Department of Statistics and Applied Probability National University of Singapore Singapore Singapore
| | - Tonghui Yu
- Department of Statistics and Applied Probability National University of Singapore Singapore Singapore
| | - Jing Lv
- Southwest University Chongqing China
| | | |
Collapse
|
9
|
Yi GY, He W, Carroll RJ. Feature screening with large-scale and high-dimensional survival data. Biometrics 2021; 78:894-907. [PMID: 33881782 DOI: 10.1111/biom.13479] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2020] [Revised: 01/27/2021] [Accepted: 04/07/2020] [Indexed: 11/27/2022]
Abstract
Data with a huge size present great challenges in modeling, inferences, and computation. In handling big data, much attention has been directed to settings with "large p small n", and relatively less work has been done to address problems with p and n being both large, though data with such a feature have now become more accessible than before, where p represents the number of variables and n stands for the sample size. The big volume of data does not automatically ensure good quality of inferences because a large number of unimportant variables may be collected in the process of gathering informative variables. To carry out valid statistical analysis, it is imperative to screen out noisy variables that have no predictive value for explaining the outcome variable. In this paper, we develop a screening method for handling large-sized survival data, where the sample size n is large and the dimension p of covariates is of non-polynomial order of the sample size n, or the so-called NP-dimension. We rigorously establish theoretical results for the proposed method and conduct numerical studies to assess its performance. Our research offers multiple extensions of existing work and enlarges the scope of high-dimensional data analysis. The proposed method capitalizes on the connections among useful regression settings and offers a computationally efficient screening procedure. Our method can be applied to different situations with large-scale data including genomic data.
Collapse
Affiliation(s)
- Grace Y Yi
- Department of Statistical and Actuarial Sciences, Department of Computer Science, University of Western Ontario, London, Ontario, Canada
| | - Wenqing He
- Department of Statistical and Actuarial Sciences, University of Western Ontario, London, Ontario, Canada
| | - Raymond J Carroll
- Department of Statistics, Texas A&M University, College Station, Texas, USA.,School of Mathematical and Physical Sciences, University of Technology Sydney, Broadway, Australia
| |
Collapse
|
10
|
An efficient algorithm for joint feature screening in ultrahigh-dimensional Cox’s model. Comput Stat 2020. [DOI: 10.1007/s00180-020-01032-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
11
|
Lu S, Chen X, Xu S, Liu C. Joint model-free feature screening for ultra-high dimensional semi-competing risks data. Comput Stat Data Anal 2020. [DOI: 10.1016/j.csda.2020.106942] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
12
|
Huang TJ, McKeague IW, Qian M. Marginal screening for high-dimensional predictors of survival outcomes. Stat Sin 2019; 29:2105-2139. [PMID: 31938013 PMCID: PMC6959482 DOI: 10.5705/ss.202017.0298] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
This study develops a marginal screening test to detect the presence of significant predictors for a right-censored time-to-event outcome under a high-dimensional accelerated failure time (AFT) model. Establishing a rigorous screening test in this setting is challenging, because of the right censoring and the post-selection inference. In the latter case, an implicit variable selection step needs to be included to avoid inflating the Type-I error. A prior study solved this problem by constructing an adaptive resampling test under an ordinary linear regression. To accommodate right censoring, we develop a new approach based on a maximally selected Koul-Susarla-Van Ryzin estimator from a marginal AFT working model. A regularized bootstrap method is used to calibrate the test. Our test is more powerful and less conservative than both a Bonferroni correction of the marginal tests and other competing methods. The proposed method is evaluated in simulation studies and applied to two real data sets.
Collapse
Affiliation(s)
| | | | - Min Qian
- Department of Biostatistics, Columbia University
| |
Collapse
|
13
|
Hong HG, Zheng Q, Li Y. Forward regression for Cox models with high-dimensional covariates. J MULTIVARIATE ANAL 2019; 173:268-290. [PMID: 31007300 PMCID: PMC6469712 DOI: 10.1016/j.jmva.2019.02.011] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Forward regression, a classical variable screening method, has been widely used for model building when the number of covariates is relatively low. However, forward regression is seldom used in high-dimensional settings because of the cumbersome computation and unknown theoretical properties. Some recent works have shown that forward regression, coupled with an extended Bayesian information criterion (EBIC)-based stopping rule, can consistently identify all relevant predictors in high-dimensional linear regression settings. However, the results are based on the sum of residual squares from linear models and it is unclear whether forward regression can be applied to more general regression settings, such as Cox proportional hazards models. We introduce a forward variable selection procedure for Cox models. It selects important variables sequentially according to the increment of partial likelihood, with an EBIC stopping rule. To our knowledge, this is the first study that investigates the partial likelihood-based forward regression in high-dimensional survival settings and establishes selection consistency results. We show that, if the dimension of the true model is finite, forward regression can discover all relevant predictors within a finite number of steps and their order of entry is determined by the size of the increment in partial likelihood. As partial likelihood is not a regular density-based likelihood, we develop some new theoretical results on partial likelihood and use these results to establish the desired sure screening properties. The practical utility of the proposed method is examined via extensive simulations and analysis of a subset of the Boston Lung Cancer Survival Cohort study, a hospital-based study for identifying biomarkers related to lung cancer patients' survival.
Collapse
Affiliation(s)
- Hyokyoung G. Hong
- Department of Statistics and Probability, Michigan State University, 19 Red Cedar Road, East Lansing, MI 48823, USA
| | - Qi Zheng
- Department of Bioinformatics and Biostatistics, University of Louisville, 485 East Gray Street, Louisville, KY 40202, USA
| | - Yi Li
- Department of Biostatistics, University of Michigan, 1415 Washington Heights Ann Arbor, MI 48109-2029, USA
| |
Collapse
|
14
|
Edelmann D, Hummel M, Hielscher T, Saadati M, Benner A. Marginal variable screening for survival endpoints. Biom J 2019; 62:610-626. [DOI: 10.1002/bimj.201800269] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2018] [Revised: 05/23/2019] [Accepted: 06/04/2019] [Indexed: 01/31/2023]
Affiliation(s)
- Dominic Edelmann
- Division of Biostatistics German Cancer Research Center (DKFZ) Heidelberg Germany
| | - Manuela Hummel
- Division of Biostatistics German Cancer Research Center (DKFZ) Heidelberg Germany
| | - Thomas Hielscher
- Division of Biostatistics German Cancer Research Center (DKFZ) Heidelberg Germany
| | - Maral Saadati
- Division of Biostatistics German Cancer Research Center (DKFZ) Heidelberg Germany
| | - Axel Benner
- Division of Biostatistics German Cancer Research Center (DKFZ) Heidelberg Germany
| |
Collapse
|
15
|
Grace HH, Li Y. Feature selection of ultrahigh-dimensional covariates with survival outcomes: a selective review. APPLIED MATHEMATICS : A JOURNAL OF CHINESE UNIVERSITIES 2017; 32:379-396. [PMID: 29683128 PMCID: PMC5906071 DOI: 10.1007/s11766-017-3547-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/28/2023]
Abstract
Many modern biomedical studies have yielded survival data with high-throughput predictors. The goals of scientific research often lie in identifying predictive biomarkers, understanding biological mechanisms and making accurate and precise predictions. Variable screening is a crucial first step in achieving these goals. This work conducts a selective review of feature screening procedures for survival data with ultrahigh dimensional covariates. We present the main methodologies, along with the key conditions that ensure sure screening properties. The practical utility of these methods is examined via extensive simulations. We conclude the review with some future opportunities in this field.
Collapse
Affiliation(s)
- Hong Hyokyoung Grace
- Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, U.S.A
| | - Yi Li
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, U.S.A
| |
Collapse
|