1
|
Ge L, Liang B, Hu T, Sun J, Zhao S, Li Y. Variable selection for mixed panel count data under the proportional mean model. Stat Methods Med Res 2023; 32:1728-1748. [PMID: 37401336 DOI: 10.1177/09622802231184637] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/05/2023]
Abstract
Mixed panel count data have attracted increasing attention in medical research based on event history studies. When such data arise, one either observes the number of event occurrences or only knows whether the event has happened or not over an observation period. In this article, we discuss variable selection in event history studies given such complex data, for which there does not seem to exist an established procedure. For the problem, we propose a penalized likelihood variable selection procedure and for the implementation, an expectation-maximization algorithm is developed with the use of the coordinate descent algorithm in the M-step. Furthermore, the oracle property of the proposed method is established, and a simulation study is performed and indicates that the proposed method works well in practical scenarios. Finally, the method is applied to identify the risk factors associated with medical non-adherence arising from the Sequenced Treatment Alternatives to Relieve Depression Study.
Collapse
Affiliation(s)
- Lei Ge
- Department of Biostatistics and Health Data Science, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Baosheng Liang
- Department of Biostatistics, School of Public Health, Peking University, Beijing, China
| | - Tao Hu
- School of Mathematical Sciences, Capital Normal University, Beijing, China
| | - Jianguo Sun
- Department of Statistics, University of Missouri, Columbia, MO, USA
| | - Shishun Zhao
- Applied Statistical Research Center, School of Mathematics, Jilin University, Changchun, China
| | - Yang Li
- Department of Biostatistics and Health Data Science, Indiana University School of Medicine, Indianapolis, IN, USA
| |
Collapse
|
2
|
Liu R, Du M, Sun J. Variable selection for bivariate interval-censored failure time data under linear transformation models. Int J Biostat 2022:ijb-2021-0031. [PMID: 35654407 DOI: 10.1515/ijb-2021-0031] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2021] [Accepted: 04/20/2022] [Indexed: 11/15/2022]
Abstract
Variable selection is needed and performed in almost every field and a large literature on it has been established, especially under the context of linear models or for complete data. Many authors have also investigated the variable selection problem for incomplete data such as right-censored failure time data. In this paper, we discuss variable selection when one faces bivariate interval-censored failure time data arising from a linear transformation model, for which it does not seem to exist an established procedure. For the problem, a penalized maximum likelihood approach is proposed and in particular, a novel Poisson-based EM algorithm is developed for the implementation. The oracle property of the proposed method is established, and the numerical studies suggest that the method works well for practical situations.
Collapse
Affiliation(s)
- Rong Liu
- Center for Applied Statistical Research, School of Mathematics, Jilin University, Changchun 130012, China
| | - Mingyue Du
- Center for Applied Statistical Research, School of Mathematics, Jilin University, Changchun 130012, China
| | - Jianguo Sun
- Department of Statistics, University of Missouri, Columbia, MO, 65211, USA
| |
Collapse
|
3
|
Wang W, Fang L, Li S, Sun J. Variable selection for misclassified current status data under the proportional hazards model. COMMUN STAT-SIMUL C 2022. [DOI: 10.1080/03610918.2022.2050391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Affiliation(s)
- Wenshan Wang
- Center for Applied Statistical Research, School of Mathematics, Jilin University, Changchun, China
| | - Lijun Fang
- School of Economics and Statistics, Guangzhou University, Guangzhou, China
| | - Shuwei Li
- School of Economics and Statistics, Guangzhou University, Guangzhou, China
| | - Jianguo Sun
- Department of Statistics, University of Missouri, Columbia, Missouri, USA
| |
Collapse
|
4
|
Variable Selection for Generalized Linear Models with Interval-Censored Failure Time Data. MATHEMATICS 2022. [DOI: 10.3390/math10050763] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Variable selection is often needed in many fields and has been discussed by many authors in various situations. This is especially the case under linear models and when one observes complete data. Among others, one common situation where variable selection is required is to identify important risk factors from a large number of covariates. In this paper, we consider the problem when one observes interval-censored failure time data arising from generalized linear models, for which there does not seem to exist an established method. To address this, we propose a penalized least squares method with the use of an unbiased transformation and the oracle property of the method is established along with the asymptotic normality of the resulting estimators of regression parameters. Simulation studies were conducted and demonstrated that the proposed method performed well for practical situations. In addition, the method was applied to a motivating example about children’s mortality data of Nigeria.
Collapse
|
5
|
Naoum GE, Ho AY, Shui A, Salama L, Goldberg S, Arafat W, Winograd J, Colwell A, Smith BL, Taghian AG. Risk of Developing Breast Reconstruction Complications: A Machine-Learning Nomogram for Individualized Risk Estimation with and without Postmastectomy Radiation Therapy. Plast Reconstr Surg 2022; 149:1e-12e. [PMID: 34758003 DOI: 10.1097/prs.0000000000008635] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
BACKGROUND The purpose of this study was to create a nomogram using machine learning models predicting risk of breast reconstruction complications with or without postmastectomy radiation therapy. METHODS Between 1997 and 2017, 1617 breast cancer patients undergoing mastectomy and breast reconstruction were analyzed. Those with autologous, tissue expander/implant, and single-stage direct-to-implant reconstruction were included. Postmastectomy radiation therapy was delivered either with three-dimensional conformal photon or proton therapy. Complication endpoints were defined based on surgical reintervention operative notes as infection/necrosis requiring débridement. For implant-based patients, complications were defined as capsular contracture requiring capsulotomy and implant failure. For each complication endpoint, least absolute shrinkage and selection operator-penalized regression was used to select the subset of predictors associated with the smallest prediction error from 10-fold cross-validation. Nomograms were built using the least absolute shrinkage and selection operator-selected predictors, and internal validation using cross-validation was performed. RESULTS Median follow-up was 6.6 years. Among 1617 patients, 23 percent underwent autologous reconstruction, 39 percent underwent direct-to-implant reconstruction, and 37 percent underwent tissue expander/implant reconstruction. Among 759 patients who received postmastectomy radiation therapy, 8.3 percent received proton-therapy to the chest wall and nodes and 43 percent received chest wall boost. Internal validation for each model showed an area under the receiver operating characteristic curve of 73 percent for infection, 75 percent for capsular contracture, 76 percent for absolute implant failure, and 68 percent for overall implant failure. Periareolar incisions and complete implant muscle coverage were found to be important predictors for infection and capsular contracture, respectively. In a multivariable analysis, we found that protons compared to no postmastectomy radiation therapy significantly increased capsular contracture risk (OR, 15.3; p < 0.001). This was higher than the effect of photons with electron boost versus no postmastectomy radiation therapy (OR, 2.5; p = 0.01). CONCLUSION Using machine learning, these nomograms provided prediction of postmastectomy breast reconstruction complications with and without radiation therapy. CLINICAL QUESTION/LEVEL OF EVIDENCE Risk, III.
Collapse
Affiliation(s)
- George E Naoum
- From the Departments of Radiation Oncology, Plastic Surgery, and Surgery and the Biostatistics Center, Massachusetts General Hospital, Harvard Medical School; and Department of Clinical Oncology, Alexandria University
| | - Alice Y Ho
- From the Departments of Radiation Oncology, Plastic Surgery, and Surgery and the Biostatistics Center, Massachusetts General Hospital, Harvard Medical School; and Department of Clinical Oncology, Alexandria University
| | - Amy Shui
- From the Departments of Radiation Oncology, Plastic Surgery, and Surgery and the Biostatistics Center, Massachusetts General Hospital, Harvard Medical School; and Department of Clinical Oncology, Alexandria University
| | - Laura Salama
- From the Departments of Radiation Oncology, Plastic Surgery, and Surgery and the Biostatistics Center, Massachusetts General Hospital, Harvard Medical School; and Department of Clinical Oncology, Alexandria University
| | - Saveli Goldberg
- From the Departments of Radiation Oncology, Plastic Surgery, and Surgery and the Biostatistics Center, Massachusetts General Hospital, Harvard Medical School; and Department of Clinical Oncology, Alexandria University
| | - Waleed Arafat
- From the Departments of Radiation Oncology, Plastic Surgery, and Surgery and the Biostatistics Center, Massachusetts General Hospital, Harvard Medical School; and Department of Clinical Oncology, Alexandria University
| | - Jonathan Winograd
- From the Departments of Radiation Oncology, Plastic Surgery, and Surgery and the Biostatistics Center, Massachusetts General Hospital, Harvard Medical School; and Department of Clinical Oncology, Alexandria University
| | - Amy Colwell
- From the Departments of Radiation Oncology, Plastic Surgery, and Surgery and the Biostatistics Center, Massachusetts General Hospital, Harvard Medical School; and Department of Clinical Oncology, Alexandria University
| | - Barbara L Smith
- From the Departments of Radiation Oncology, Plastic Surgery, and Surgery and the Biostatistics Center, Massachusetts General Hospital, Harvard Medical School; and Department of Clinical Oncology, Alexandria University
| | - Alphonse G Taghian
- From the Departments of Radiation Oncology, Plastic Surgery, and Surgery and the Biostatistics Center, Massachusetts General Hospital, Harvard Medical School; and Department of Clinical Oncology, Alexandria University
| |
Collapse
|
6
|
Li N, Peng X, Kawaguchi E, Suchard MA, Li G. A scalable surrogate L0 sparse regression method for generalized linear models with applications to large scale data. J Stat Plan Inference 2021. [DOI: 10.1016/j.jspi.2020.12.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
7
|
Liu Z, Liang M, Grant CN, Spiegelman VS, Wang HG. Interpretable models for high-risk neuroblastoma stratification with multi-cohort copy number profiles. INFORMATICS IN MEDICINE UNLOCKED 2021. [DOI: 10.1016/j.imu.2021.100701] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
|
8
|
Yi F, Tang N, Sun J. Simultaneous variable selection and estimation for joint models of longitudinal and failure time data with interval censoring. Biometrics 2020; 78:151-164. [PMID: 33031576 DOI: 10.1111/biom.13387] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2019] [Revised: 09/27/2020] [Accepted: 09/30/2020] [Indexed: 11/27/2022]
Abstract
This paper discusses variable selection in the context of joint analysis of longitudinal data and failure time data. A large literature has been developed for either variable selection or the joint analysis but there exists only limited literature for variable selection in the context of the joint analysis when failure time data are right censored. Corresponding to this, we will consider the situation where instead of right-censored data, one observes interval-censored failure time data, a more general and commonly occurring form of failure time data. For the problem, a class of penalized likelihood-based procedures will be developed for simultaneous variable selection and estimation of relevant covariate effects for both longitudinal and failure time variables of interest. In particular, a Monte Carlo EM (MCEM) algorithm is presented for the implementation of the proposed approach. The proposed method allows for the number of covariates to be diverging with the sample size and is shown to have the oracle property. An extensive simulation study is conducted to assess the finite sample performance of the proposed approach and indicates that it works well in practical situations. An application is also provided.
Collapse
Affiliation(s)
- Fengting Yi
- School of Statistics, Southwestern University of Finance and Economics, Chengdu, China.,Yunnan Key Laboratory of Statistical Modeling and Data Analysis, Yunnan University, Kunming, China
| | - Niansheng Tang
- Yunnan Key Laboratory of Statistical Modeling and Data Analysis, Yunnan University, Kunming, China
| | - Jianguo Sun
- Department of Statistics, University of Missouri, Columbia, Missouri
| |
Collapse
|
9
|
Wu Q, Zhao H, Zhu L, Sun J. Variable selection for high-dimensional partly linear additive Cox model with application to Alzheimer's disease. Stat Med 2020; 39:3120-3134. [PMID: 32652699 DOI: 10.1002/sim.8594] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2019] [Revised: 03/20/2020] [Accepted: 05/13/2020] [Indexed: 11/10/2022]
Abstract
Variable selection has been discussed under many contexts and especially, a large literature has been established for the analysis of right-censored failure time data. In this article, we discuss an interval-censored failure time situation where there exist two sets of covariates with one being low-dimensional and having possible nonlinear effects and the other being high-dimensional. For the problem, we present a penalized estimation procedure for simultaneous variable selection and estimation, and in the method, Bernstein polynomials are used to approximate the involved nonlinear functions. Furthermore, for implementation, a coordinate-wise optimization algorithm, which can accommodate most commonly used penalty functions, is developed. A numerical study is performed for the evaluation of the proposed approach and suggests that it works well in practical situations. Finally the method is applied to an Alzheimer's disease study that motivated this investigation.
Collapse
Affiliation(s)
- Qiwei Wu
- Eli Lilly and Company, Indianapolis, Indiana, USA
| | - Hui Zhao
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan, China
| | - Liang Zhu
- Division of Clinical and Translational Sciences, Department of Internal Medicine, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Jianguo Sun
- Department of Statistics, University of Missouri, Columbia, Missouri, USA
| |
Collapse
|
10
|
Li S, Wu Q, Sun J. Penalized estimation of semiparametric transformation models with interval-censored data and application to Alzheimer's disease. Stat Methods Med Res 2019; 29:2151-2166. [PMID: 31718478 DOI: 10.1177/0962280219884720] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Variable selection or feature extraction is fundamental to identify important risk factors from a large number of covariates and has applications in many fields. In particular, its applications in failure time data analysis have been recognized and many methods have been proposed for right-censored data. However, developing relevant methods for variable selection becomes more challenging when one confronts interval censoring that often occurs in practice. In this article, motivated by an Alzheimer's disease study, we develop a variable selection method for interval-censored data with a general class of semiparametric transformation models. Specifically, a novel penalized expectation-maximization algorithm is developed to maximize the complex penalized likelihood function, which is shown to perform well in the finite-sample situation through a simulation study. The proposed methodology is then applied to the interval-censored data arising from the Alzheimer's disease study mentioned above.
Collapse
Affiliation(s)
- Shuwei Li
- School of Economics and Statistics, Guangzhou University, Guangzhou, China
| | - Qiwei Wu
- Department of Statistics, University of Missouri, Columbia, MO, USA
| | - Jianguo Sun
- Department of Statistics, University of Missouri, Columbia, MO, USA
| |
Collapse
|
11
|
Liu Z, Elashoff D, Piantadosi S. Sparse support vector machines with L 0 approximation for ultra-high dimensional omics data. Artif Intell Med 2019; 96:134-141. [PMID: 31164207 DOI: 10.1016/j.artmed.2019.04.004] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2018] [Revised: 03/31/2019] [Accepted: 04/27/2019] [Indexed: 12/30/2022]
Abstract
Omics data usually have ultra-high dimension (p) and small sample size (n). Standard support vector machines (SVMs), which minimize the L2 norm for the primal variables, only lead to sparse solutions for the dual variables. L1 based SVMs, directly minimizing the L1 norm, have been used for feature selection with omics data. However, most current methods directly solve the primal formulations of the problem, which are not computationally scalable. The computational complexity increases with the number of features. In addition, L1 norm is known to be asymptotically biased and not consistent for feature selection. In this paper, we develop an efficient method for sparse support vector machines with L0 norm approximation. The proposed method approximates the L0 minimization through solving a series of L2 optimization problems, which can be formulated with dual variables. It finds the optimal solution for p primal variables through estimating n dual variables, which is more efficient as long as the sample size is small. L0 approximation leads to sparsity in both dual and primal variables, and can be used for both feature and sample selections. The proposed method identifies much less number of features and achieves similar performances in simulations. We apply the proposed method to feature selections with metagenomic sequencing and gene expression data. It can identify biologically important genes and taxa efficiently.
Collapse
Affiliation(s)
- Zhenqiu Liu
- Department of Public Health Sciences, Penn State College of Medicine, Hershey, PA 17033, USA.
| | - David Elashoff
- Department of Medicine, University of California at Los Angeles, CA 90024, USA
| | - Steven Piantadosi
- Samuel Oschin Cancer Center, Cedars-Sinai Medical Center, Los Angeles, CA 90048, USA
| |
Collapse
|
12
|
Zhao H, Sun D, Li G, Sun J. Simultaneous estimation and variable selection for incomplete event history studies. J MULTIVARIATE ANAL 2019. [DOI: 10.1016/j.jmva.2019.01.005] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
13
|
Zhao H, Wu Q, Li G, Sun J. Simultaneous Estimation and Variable Selection for Interval-Censored Data with Broken Adaptive Ridge Regression. J Am Stat Assoc 2019; 115:204-216. [PMID: 32742044 DOI: 10.1080/01621459.2018.1537922] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
The simultaneous estimation and variable selection for Cox model has been discussed by several authors (Fan and Li, 2002; Huang and Ma, 2010; Tibshirani, 1997) when one observes right-censored failure time data. However, there does not seem to exist an established procedure for interval-censored data, a more general and complex type of failure time data, except two parametric procedures given in Scolas et al. (2016) and Wu and Cook (2015). To address this, we propose a broken adaptive ridge (BAR) regression procedure that combines the strengths of the quadratic regularization and the adaptive weighted bridge shrinkage. In particular, the method allows for the number of covariates to be diverging with the sample size. Under some weak regularity conditions, unlike most of the existing variable selection methods, we establish both the oracle property and the grouping effect of the proposed BAR procedure. An extensive simulation study is conducted and indicates that the proposed approach works well in practical situations and deals with the collinearity problem better than the other oracle-like methods. An application is also provided.
Collapse
Affiliation(s)
- Hui Zhao
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan, China
| | - Qiwei Wu
- Department of Statistics, University of Missouri, Columbia, MO, U.S.A
| | - Gang Li
- Department of Biostatistics, University of California at Los Angeles, CA, U.S.A
| | - Jianguo Sun
- Department of Statistics, University of Missouri, Columbia, MO, U.S.A
| |
Collapse
|
14
|
Dai L, Chen K, Sun Z, Liu Z, Li G. Broken adaptive ridge regression and its asymptotic properties. J MULTIVARIATE ANAL 2018; 168:334-351. [PMID: 30911202 PMCID: PMC6430210 DOI: 10.1016/j.jmva.2018.08.007] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
This paper studies the asymptotic properties of a sparse linear regression estimator, referred to as broken adaptive ridge (BAR) estimator, resulting from an L 0-based iteratively reweighted L 2 penalization algorithm using the ridge estimator as its initial value. We show that the BAR estimator is consistent for variable selection and has an oracle property for parameter estimation. Moreover, we show that the BAR estimator possesses a grouping effect: highly correlated covariates are naturally grouped together, which is a desirable property not known for other oracle variable selection methods. Lastly, we combine BAR with a sparsity-restricted least squares estimator and give conditions under which the resulting two-stage sparse regression method is selection and estimation consistent in addition to having the grouping property in high- or ultrahigh-dimensional settings. Numerical studies are conducted to investigate and illustrate the operating characteristics of the BAR method in comparison with other methods.
Collapse
Affiliation(s)
- Linlin Dai
- Southwestern University of Finance and Economics, Chengdu, China
| | - Kani Chen
- Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong
| | - Zhihua Sun
- Institute of Mathematics, Ocean University of China, Qingdao, China
| | - Zhenqiu Liu
- Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Gang Li
- Department of Biostatistics, School of Public Health, University of California at Los Angeles, CA 90095-1772, USA
| |
Collapse
|
15
|
Zhao H, Sun D, Li G, Sun J. Variable selection for recurrent event data with broken adaptive ridge regression. CAN J STAT 2018; 46:416-428. [PMID: 32999527 DOI: 10.1002/cjs.11459] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Recurrent event data occur in many areas such as medical studies and social sciences and a great deal of literature has been established for their analysis. On the other hand, only limited research exists on the variable selection for recurrent event data, and the existing methods can be seen as direct generalizations of the available penalized procedures for linear models and may not perform as well as expected. This article discusses simultaneous parameter estimation and variable selection and presents a new method with a new penalty function, which will be referred to as the broken adaptive ridge regression approach. In addition to the establishment of the oracle property, we also show that the proposed method has the clustering or grouping effect when covariates are highly correlated. Furthermore, a numerical study is performed and indicates that the method works well for practical situations and can outperform existing methods. An application is provided.
Collapse
Affiliation(s)
- Hui Zhao
- School of Mathematics and Statistics & Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Wuhan, China
| | - Dayu Sun
- Department of Statistics, University of Missouri, Columbia, MO, U.S.A
| | - Gang Li
- Department of Biostatistics, University of California at Los Angeles, CA, U.S.A
| | - Jianguo Sun
- Department of Statistics, University of Missouri, Columbia, MO, U.S.A
| |
Collapse
|
16
|
Liu Z, Sun F, McGovern DP. Sparse generalized linear model with L0 approximation for feature selection and prediction with big omics data. BioData Min 2017; 10:39. [PMID: 29270229 PMCID: PMC5735537 DOI: 10.1186/s13040-017-0159-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2017] [Accepted: 12/04/2017] [Indexed: 11/10/2022] Open
Abstract
Background Feature selection and prediction are the most important tasks for big data mining. The common strategies for feature selection in big data mining are L1, SCAD and MC+. However, none of the existing algorithms optimizes L0, which penalizes the number of nonzero features directly. Results In this paper, we develop a novel sparse generalized linear model (GLM) with L0 approximation for feature selection and prediction with big omics data. The proposed approach approximate the L0 optimization directly. Even though the original L0 problem is non-convex, the problem is approximated by sequential convex optimizations with the proposed algorithm. The proposed method is easy to implement with only several lines of code. Novel adaptive ridge algorithms (L0ADRIDGE) for L0 penalized GLM with ultra high dimensional big data are developed. The proposed approach outperforms the other cutting edge regularization methods including SCAD and MC+ in simulations. When it is applied to integrated analysis of mRNA, microRNA, and methylation data from TCGA ovarian cancer, multilevel gene signatures associated with suboptimal debulking are identified simultaneously. The biological significance and potential clinical importance of those genes are further explored. Conclusions The developed Software L0ADRIDGE in MATLAB is available at https://github.com/liuzqx/L0adridge. Electronic supplementary material The online version of this article (doi:10.1186/s13040-017-0159-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Zhenqiu Liu
- Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, 90048 CA USA
| | - Fengzhu Sun
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, 90089 CA USA
| | - Dermot P McGovern
- Foundation Inflammatory Bowel & Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, 90048 CA USA
| |
Collapse
|