1
|
Frommlet F. A neutral comparison of algorithms to minimize L 0 penalties for high-dimensional variable selection. Biom J 2024; 66:e2200207. [PMID: 37421205 DOI: 10.1002/bimj.202200207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Revised: 03/09/2023] [Accepted: 04/29/2023] [Indexed: 07/10/2023]
Abstract
Variable selection methods based on L0 penalties have excellent theoretical properties to select sparse models in a high-dimensional setting. There exist modifications of the Bayesian Information Criterion (BIC) which either control the familywise error rate (mBIC) or the false discovery rate (mBIC2) in terms of which regressors are selected to enter a model. However, the minimization of L0 penalties comprises a mixed-integer problem which is known to be NP-hard and therefore becomes computationally challenging with increasing numbers of regressor variables. This is one reason why alternatives like the LASSO have become so popular, which involve convex optimization problems that are easier to solve. The last few years have seen some real progress in developing new algorithms to minimize L0 penalties. The aim of this article is to compare the performance of these algorithms in terms of minimizing L0 -based selection criteria. Simulation studies covering a wide range of scenarios that are inspired by genetic association studies are used to compare the values of selection criteria obtained with different algorithms. In addition, some statistical characteristics of the selected models and the runtime of algorithms are compared. Finally, the performance of the algorithms is illustrated in a real data example concerned with expression quantitative trait loci (eQTL) mapping.
Collapse
Affiliation(s)
- Florian Frommlet
- Institute of Medical Statistics, Center for Medical Data Science, Medical University of Vienna, Vienna, Austria
| |
Collapse
|
2
|
Wang K, Li X, Liu Y, Kang L. A communication-efficient method for generalized linear regression with ℓ 0 regularization. COMMUN STAT-SIMUL C 2022. [DOI: 10.1080/03610918.2022.2115072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Affiliation(s)
- Kunpeng Wang
- School of Mathematics and Statistics, Wuhan University, Wuhan, China
| | - Xuerui Li
- School of Mathematics and Statistics, Wuhan University, Wuhan, China
| | - Yanyan Liu
- School of Mathematics and Statistics, Wuhan University, Wuhan, China
| | - Lican Kang
- Center for Quantitative Medicine Duke-NUS Medical School, Singapore, Singapore
| |
Collapse
|
3
|
Smallman L, Artemiou A. A Literature Review of (Sparse) Exponential Family PCA. JOURNAL OF STATISTICAL THEORY AND PRACTICE 2022. [DOI: 10.1007/s42519-021-00238-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
AbstractThis is a brief overview of the methodology around exponential family PCA. We revisit classic PCA methodology, and we focus on exponential family PCA due to its applicability on a number of distributions and hence a wide variety of problems. We discuss the applicability of these methods to text data analysis due to the high-dimensional and sparse nature of these data.
Collapse
|
4
|
Guo Z, Chen M, Fan Y, Song Y. A general adaptive ridge regression method for generalized linear models: an iterative re-weighting approach. COMMUN STAT-THEOR M 2022. [DOI: 10.1080/03610926.2022.2028841] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Affiliation(s)
- Zijun Guo
- College of Science, University of Shanghai for Science and Technology, Shanghai, China
| | - Mengxing Chen
- College of Science, University of Shanghai for Science and Technology, Shanghai, China
| | - Yali Fan
- College of Science, University of Shanghai for Science and Technology, Shanghai, China
| | - Yan Song
- Department of Control Science and Engineering, University of Shanghai for Science and Technology, Shanghai, China
| |
Collapse
|
5
|
Aydın D, Ahmed SE, Yılmaz E. Right-Censored Time Series Modeling by Modified Semi-Parametric A-Spline Estimator. ENTROPY 2021; 23:e23121586. [PMID: 34945891 PMCID: PMC8699840 DOI: 10.3390/e23121586] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Revised: 11/20/2021] [Accepted: 11/22/2021] [Indexed: 11/16/2022]
Abstract
This paper focuses on the adaptive spline (A-spline) fitting of the semiparametric regression model to time series data with right-censored observations. Typically, there are two main problems that need to be solved in such a case: dealing with censored data and obtaining a proper A-spline estimator for the components of the semiparametric model. The first problem is traditionally solved by the synthetic data approach based on the Kaplan-Meier estimator. In practice, although the synthetic data technique is one of the most widely used solutions for right-censored observations, the transformed data's structure is distorted, especially for heavily censored datasets, due to the nature of the approach. In this paper, we introduced a modified semiparametric estimator based on the A-spline approach to overcome data irregularity with minimum information loss and to resolve the second problem described above. In addition, the semiparametric B-spline estimator was used as a benchmark method to gauge the success of the A-spline estimator. To this end, a detailed Monte Carlo simulation study and a real data sample were carried out to evaluate the performance of the proposed estimator and to make a practical comparison.
Collapse
Affiliation(s)
- Dursun Aydın
- Department of Statistics, Faculty of Science, Mugla Sitki Kocman University, Kotekli 48000, Turkey;
| | - Syed Ejaz Ahmed
- Department of Mathematics and Statistics, Faculty of Science, Brock University, 1812 Sir Isaac Brock Way, St. Catharines, ON L2S 3A1, Canada;
| | - Ersin Yılmaz
- Department of Statistics, Faculty of Science, Mugla Sitki Kocman University, Kotekli 48000, Turkey;
- Correspondence:
| |
Collapse
|
6
|
Zhao H, Zheng K, Li Y, Wang J. A novel graph attention model for predicting frequencies of drug-side effects from multi-view data. Brief Bioinform 2021; 22:6312959. [PMID: 34213525 DOI: 10.1093/bib/bbab239] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Revised: 05/30/2021] [Accepted: 06/04/2021] [Indexed: 12/15/2022] Open
Abstract
Identifying the frequencies of the drug-side effects is a very important issue in pharmacological studies and drug risk-benefit. However, designing clinical trials to determine the frequencies is usually time consuming and expensive, and most existing methods can only predict the drug-side effect existence or associations, not their frequencies. Inspired by the recent progress of graph neural networks in the recommended system, we develop a novel prediction model for drug-side effect frequencies, using a graph attention network to integrate three different types of features, including the similarity information, known drug-side effect frequency information and word embeddings. In comparison, the few available studies focusing on frequency prediction use only the known drug-side effect frequency scores. One novel approach used in this work first decomposes the feature types in drug-side effect graph to extract different view representation vectors based on three different type features, and then recombines these latent view vectors automatically to obtain unified embeddings for prediction. The proposed method demonstrates high effectiveness in 10-fold cross-validation. The computational results show that the proposed method achieves the best performance in the benchmark dataset, outperforming the state-of-the-art matrix decomposition model. In addition, some ablation experiments and visual analyses are also supplied to illustrate the usefulness of our method for the prediction of the drug-side effect frequencies. The codes of MGPred are available at https://github.com/zhc940702/MGPred and https://zenodo.org/record/4449613.
Collapse
Affiliation(s)
- Haochen Zhao
- School of Computer Science and Engineering, Central South University, Changsha 410083, China.,Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha 410083, China
| | - Kai Zheng
- School of Computer Science and Engineering, Central South University, Changsha 410083, China.,Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha 410083, China
| | - Yaohang Li
- Department of Computer Science, Old Dominion University, Norfolk, VA 23529-0001, United States
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha 410083, China.,Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha 410083, China
| |
Collapse
|
7
|
Li N, Peng X, Kawaguchi E, Suchard MA, Li G. A scalable surrogate L0 sparse regression method for generalized linear models with applications to large scale data. J Stat Plan Inference 2021. [DOI: 10.1016/j.jspi.2020.12.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
8
|
Bouaziz O, Lauridsen E, Nuel G. Regression modelling of interval censored data based on the adaptive ridge procedure. J Appl Stat 2021; 49:3319-3343. [DOI: 10.1080/02664763.2021.1944996] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Affiliation(s)
| | - Eva Lauridsen
- Ressource Center for Rare Oral Diseases, Copenhagen University Hospital, Copenhagen, Denmark
| | | |
Collapse
|
9
|
Saishu H, Kudo K, Takano Y. Sparse Poisson regression via mixed-integer optimization. PLoS One 2021; 16:e0249916. [PMID: 33886612 PMCID: PMC8062005 DOI: 10.1371/journal.pone.0249916] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2020] [Accepted: 03/26/2021] [Indexed: 11/20/2022] Open
Abstract
We present a mixed-integer optimization (MIO) approach to sparse Poisson regression. The MIO approach to sparse linear regression was first proposed in the 1970s, but has recently received renewed attention due to advances in optimization algorithms and computer hardware. In contrast to many sparse estimation algorithms, the MIO approach has the advantage of finding the best subset of explanatory variables with respect to various criterion functions. In this paper, we focus on a sparse Poisson regression that maximizes the weighted sum of the log-likelihood function and the L2-regularization term. For this problem, we derive a mixed-integer quadratic optimization (MIQO) formulation by applying a piecewise-linear approximation to the log-likelihood function. Optimization software can solve this MIQO problem to optimality. Moreover, we propose two methods for selecting a limited number of tangent lines effective for piecewise-linear approximations. We assess the efficacy of our method through computational experiments using synthetic and real-world datasets. Our methods provide better log-likelihood values than do conventional greedy algorithms in selecting tangent lines. In addition, our MIQO formulation delivers better out-of-sample prediction performance than do forward stepwise selection and L1-regularized estimation, especially in low-noise situations.
Collapse
Affiliation(s)
- Hiroki Saishu
- Graduate School of Science and Technology, University of Tsukuba, Tsukuba, Ibaraki, Japan
| | - Kota Kudo
- Graduate School of Science and Technology, University of Tsukuba, Tsukuba, Ibaraki, Japan
| | - Yuichi Takano
- Faculty of Engineering, Information and Systems, University of Tsukuba, Tsukuba, Ibaraki, Japan
- * E-mail:
| |
Collapse
|
10
|
Huang J, Jiao Y, Kang L, Liu J, Liu Y, Lu X. GSDAR: a fast Newton algorithm for $$\ell _0$$ regularized generalized linear models with statistical guarantee. Comput Stat 2021. [DOI: 10.1007/s00180-021-01098-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
11
|
Goepp V, Thalabard JC, Nuel G, Bouaziz O. Regularized bidimensional estimation of the hazard rate. Int J Biostat 2021; 18:263-277. [PMID: 33768761 DOI: 10.1515/ijb-2019-0003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2019] [Accepted: 02/26/2021] [Indexed: 11/15/2022]
Abstract
In epidemiological or demographic studies, with variable age at onset, a typical quantity of interest is the incidence of a disease (for example the cancer incidence). In these studies, the individuals are usually highly heterogeneous in terms of dates of birth (the cohort) and with respect to the calendar time (the period) and appropriate estimation methods are needed. In this article a new estimation method is presented which extends classical age-period-cohort analysis by allowing interactions between age, period and cohort effects. We introduce a bidimensional regularized estimate of the hazard rate where a penalty is introduced on the likelihood of the model. This penalty can be designed either to smooth the hazard rate or to enforce consecutive values of the hazard to be equal, leading to a parsimonious representation of the hazard rate. In the latter case, we make use of an iterative penalized likelihood scheme to approximate the L 0 norm, which makes the computation tractable. The method is evaluated on simulated data and applied on breast cancer survival data from the SEER program.
Collapse
Affiliation(s)
- Vivien Goepp
- MAP5, CNRS UMR 8145, 45, rue des Saints-Pères, 75006, Paris, France.,MINES ParisTech, CBIO-Centre for Computational Biology, PSL Research University, 75006, Paris, France.,Institut Curie, PSL Research University, 75005, Paris, France.,Inserm, U900, Paris, France
| | | | - Grégory Nuel
- LPSM, CNRS UMR 8001, 4, Place Jussieu, 75005, Paris, France
| | - Olivier Bouaziz
- MAP5, CNRS UMR 8145, 45, rue des Saints-Pères, 75006, Paris, France
| |
Collapse
|
12
|
Affiliation(s)
- Yichao Wu
- Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, IL
| |
Collapse
|
13
|
Variable Selection in Threshold Regression Model with Applications to HIV Drug Adherence Data. STATISTICS IN BIOSCIENCES 2020; 12:376-398. [PMID: 33796162 DOI: 10.1007/s12561-020-09284-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
The threshold regression model is an effective alternative to the Cox proportional hazards regression model when the proportional hazards assumption is not met. This paper considers variable selection for threshold regression. This model has separate regression functions for the initial health status and the speed of degradation in health. This flexibility is an important advantage when considering relevant risk factors for a complex time-to-event model where one needs to decide which variables should be included in the regression function for the initial health status, in the function for the speed of degradation in health, or in both functions. In this paper, we extend the broken adaptive ridge (BAR) method, originally designed for variable selection for one regression function, to simultaneous variable selection for both regression functions needed in the threshold regression model. We establish variable selection consistency of the proposed method and asymptotic normality of the estimator of non-zero regression coefficients. Simulation results show that our method outperformed threshold regression without variable selection and variable selection based on the Akaike information criterion. We apply the proposed method to data from an HIV drug adherence study in which electronic monitoring of drug intake is used to identify risk factors for non- adherence.
Collapse
|
14
|
Abstract
This paper aims to solve the problem of fitting a nonparametric regression function with right-censored data. In general, issues of censorship in the response variable are solved by synthetic data transformation based on the Kaplan–Meier estimator in the literature. In the context of synthetic data, there have been different studies on the estimation of right-censored nonparametric regression models based on smoothing splines, regression splines, kernel smoothing, local polynomials, and so on. It should be emphasized that synthetic data transformation manipulates the observations because it assigns zero values to censored data points and increases the size of the observations. Thus, an irregularly distributed dataset is obtained. We claim that adaptive spline (A-spline) regression has the potential to deal with this irregular dataset more easily than the smoothing techniques mentioned here, due to the freedom to determine the degree of the spline, as well as the number and location of the knots. The theoretical properties of A-splines with synthetic data are detailed in this paper. Additionally, we support our claim with numerical studies, including a simulation study and a real-world data example.
Collapse
|
15
|
Kawaguchi ES, Suchard MA, Liu Z, Li G. A surrogate ℓ 0 sparse Cox's regression with applications to sparse high-dimensional massive sample size time-to-event data. Stat Med 2020; 39:675-686. [PMID: 31814146 PMCID: PMC8386178 DOI: 10.1002/sim.8438] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2019] [Revised: 09/30/2019] [Accepted: 11/02/2019] [Indexed: 11/11/2022]
Abstract
Sparse high-dimensional massive sample size (sHDMSS) time-to-event data present multiple challenges to quantitative researchers as most current sparse survival regression methods and software will grind to a halt and become practically inoperable. This paper develops a scalable ℓ0 -based sparse Cox regression tool for right-censored time-to-event data that easily takes advantage of existing high performance implementation of ℓ2 -penalized regression method for sHDMSS time-to-event data. Specifically, we extend the ℓ0 -based broken adaptive ridge (BAR) methodology to the Cox model, which involves repeatedly performing reweighted ℓ2 -penalized regression. We rigorously show that the resulting estimator for the Cox model is selection consistent, oracle for parameter estimation, and has a grouping property for highly correlated covariates. Furthermore, we implement our BAR method in an R package for sHDMSS time-to-event data by leveraging existing efficient algorithms for massive ℓ2 -penalized Cox regression. We evaluate the BAR Cox regression method by extensive simulations and illustrate its application on an sHDMSS time-to-event data from the National Trauma Data Bank with hundreds of thousands of observations and tens of thousands sparsely represented covariates.
Collapse
Affiliation(s)
- Eric S. Kawaguchi
- Department of Preventive Medicine, University of Southern California, Los Angeles, California
| | - Marc A. Suchard
- Department of Preventive Medicine, University of Southern California, Los Angeles, California
- Department of Biomathematics, University of California, Los Angeles, California
- Department of Human Genetics, University of California, Los Angeles, California
| | - Zhenqiu Liu
- Department of Public Health Sciences, Penn State Cancer Institute, Hershey, Pennsylvania
| | - Gang Li
- Department of Preventive Medicine, University of Southern California, Los Angeles, California
- Department of Biomathematics, University of California, Los Angeles, California
| |
Collapse
|
16
|
Simple Poisson PCA: an algorithm for (sparse) feature extraction with simultaneous dimension determination. Comput Stat 2019. [DOI: 10.1007/s00180-019-00903-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
17
|
Wang H, Li G. Extreme learning machine Cox model for high-dimensional survival analysis. Stat Med 2019; 38:2139-2156. [PMID: 30632193 PMCID: PMC6498851 DOI: 10.1002/sim.8090] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2018] [Revised: 10/11/2018] [Accepted: 12/12/2018] [Indexed: 11/07/2022]
Abstract
Some interesting recent studies have shown that neural network models are useful alternatives in modeling survival data when the assumptions of a classical parametric or semiparametric survival model such as the Cox (1972) model are seriously violated. However, to the best of our knowledge, the plausibility of adapting the emerging extreme learning machine (ELM) algorithm for single-hidden-layer feedforward neural networks to survival analysis has not been explored. In this paper, we present a kernel ELM Cox model regularized by an L0 -based broken adaptive ridge (BAR) penalization method. Then, we demonstrate that the resulting method, referred to as ELMCoxBAR, can outperform some other state-of-art survival prediction methods such as L1 - or L2 -regularized Cox regression, random survival forest with various splitting rules, and boosted Cox model, in terms of its predictive performance using both simulated and real world datasets. In addition to its good predictive performance, we illustrate that the proposed method has a key computational advantage over the above competing methods in terms of computation time efficiency using an a real-world ultra-high-dimensional survival data.
Collapse
Affiliation(s)
- Hong Wang
- School of Mathematics and Statistics, Central South University, Changsha, China
| | - Gang Li
- Department of Biostatistics, UCLA Fielding School of Public Health, University of California, Los Angeles, California
| |
Collapse
|
18
|
Zhao H, Sun D, Li G, Sun J. Simultaneous estimation and variable selection for incomplete event history studies. J MULTIVARIATE ANAL 2019. [DOI: 10.1016/j.jmva.2019.01.005] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
19
|
Dai L, Chen K, Sun Z, Liu Z, Li G. Broken adaptive ridge regression and its asymptotic properties. J MULTIVARIATE ANAL 2018; 168:334-351. [PMID: 30911202 PMCID: PMC6430210 DOI: 10.1016/j.jmva.2018.08.007] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
This paper studies the asymptotic properties of a sparse linear regression estimator, referred to as broken adaptive ridge (BAR) estimator, resulting from an L 0-based iteratively reweighted L 2 penalization algorithm using the ridge estimator as its initial value. We show that the BAR estimator is consistent for variable selection and has an oracle property for parameter estimation. Moreover, we show that the BAR estimator possesses a grouping effect: highly correlated covariates are naturally grouped together, which is a desirable property not known for other oracle variable selection methods. Lastly, we combine BAR with a sparsity-restricted least squares estimator and give conditions under which the resulting two-stage sparse regression method is selection and estimation consistent in addition to having the grouping property in high- or ultrahigh-dimensional settings. Numerical studies are conducted to investigate and illustrate the operating characteristics of the BAR method in comparison with other methods.
Collapse
Affiliation(s)
- Linlin Dai
- Southwestern University of Finance and Economics, Chengdu, China
| | - Kani Chen
- Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong
| | - Zhihua Sun
- Institute of Mathematics, Ocean University of China, Qingdao, China
| | - Zhenqiu Liu
- Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Gang Li
- Department of Biostatistics, School of Public Health, University of California at Los Angeles, CA 90095-1772, USA
| |
Collapse
|
20
|
Zhao H, Sun D, Li G, Sun J. Variable selection for recurrent event data with broken adaptive ridge regression. CAN J STAT 2018; 46:416-428. [PMID: 32999527 DOI: 10.1002/cjs.11459] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Recurrent event data occur in many areas such as medical studies and social sciences and a great deal of literature has been established for their analysis. On the other hand, only limited research exists on the variable selection for recurrent event data, and the existing methods can be seen as direct generalizations of the available penalized procedures for linear models and may not perform as well as expected. This article discusses simultaneous parameter estimation and variable selection and presents a new method with a new penalty function, which will be referred to as the broken adaptive ridge regression approach. In addition to the establishment of the oracle property, we also show that the proposed method has the clustering or grouping effect when covariates are highly correlated. Furthermore, a numerical study is performed and indicates that the method works well for practical situations and can outperform existing methods. An application is provided.
Collapse
Affiliation(s)
- Hui Zhao
- School of Mathematics and Statistics & Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Wuhan, China
| | - Dayu Sun
- Department of Statistics, University of Missouri, Columbia, MO, U.S.A
| | - Gang Li
- Department of Biostatistics, University of California at Los Angeles, CA, U.S.A
| | - Jianguo Sun
- Department of Statistics, University of Missouri, Columbia, MO, U.S.A
| |
Collapse
|
21
|
SAFlex: A structural alphabet extension to integrate protein structural flexibility and missing data information. PLoS One 2018; 13:e0198854. [PMID: 29975698 PMCID: PMC6033379 DOI: 10.1371/journal.pone.0198854] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2017] [Accepted: 05/25/2018] [Indexed: 11/19/2022] Open
Abstract
In this paper, we describe SAFlex (Structural Alphabet Flexibility), an extension of an existing structural alphabet (HMM-SA), to better explore increasing protein three dimensional structure information by encoding conformations of proteins in case of missing residues or uncertainties. An SA aims to reduce three dimensional conformations of proteins as well as their analysis and comparison complexity by simplifying any conformation in a series of structural letters. Our methodology presents several novelties. Firstly, it can account for the encoding uncertainty by providing a wide range of encoding options: the maximum a posteriori, the marginal posterior distribution, and the effective number of letters at each given position. Secondly, our new algorithm deals with the missing data in the protein structure files (concerning more than 75% of the proteins from the Protein Data Bank) in a rigorous probabilistic framework. Thirdly, SAFlex is able to encode and to build a consensus encoding from different replicates of a single protein such as several homomer chains. This allows localizing structural differences between different chains and detecting structural variability, which is essential for protein flexibility identification. These improvements are illustrated on different proteins, such as the crystal structure of an eukaryotic small heat shock protein. They are promising to explore increasing protein redundancy data and obtain useful quantification of their flexibility.
Collapse
|
22
|
Wittkowski KM, Dadurian C, Seybold MP, Kim HS, Hoshino A, Lyden D. Complex polymorphisms in endocytosis genes suggest alpha-cyclodextrin as a treatment for breast cancer. PLoS One 2018; 13:e0199012. [PMID: 29965997 PMCID: PMC6028090 DOI: 10.1371/journal.pone.0199012] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2017] [Accepted: 05/17/2018] [Indexed: 02/06/2023] Open
Abstract
Most breast cancer deaths are caused by metastasis and treatment options beyond radiation and cytotoxic drugs, which have severe side effects, and hormonal treatments, which are or become ineffective for many patients, are urgently needed. This study reanalyzed existing data from three genome-wide association studies (GWAS) using a novel computational biostatistics approach (muGWAS), which had been validated in studies of 600-2000 subjects in epilepsy and autism. MuGWAS jointly analyzes several neighboring single nucleotide polymorphisms while incorporating knowledge about genetics of heritable diseases into the statistical method and about GWAS into the rules for determining adaptive genome-wide significance. Results from three independent GWAS of 1000-2000 subjects each, which were made available under the National Institute of Health's "Up For A Challenge" (U4C) project, not only confirmed cell-cycle control and receptor/AKT signaling, but, for the first time in breast cancer GWAS, also consistently identified many genes involved in endo-/exocytosis (EEC), most of which had already been observed in functional and expression studies of breast cancer. In particular, the findings include genes that translocate (ATP8A1, ATP8B1, ANO4, ABCA1) and metabolize (AGPAT3, AGPAT4, DGKQ, LPPR1) phospholipids entering the phosphatidylinositol cycle, which controls EEC. These novel findings suggest scavenging phospholipids as a novel intervention to control local spread of cancer, packaging of exosomes (which prepare distant microenvironment for organ-specific metastases), and endocytosis of β1 integrins (which are required for spread of metastatic phenotype and mesenchymal migration of tumor cells). Beta-cyclodextrins (βCD) have already been shown to be effective in in vitro and animal studies of breast cancer, but exhibits cholesterol-related ototoxicity. The smaller alpha-cyclodextrins (αCD) also scavenges phospholipids, but cannot fit cholesterol. An in-vitro study presented here confirms hydroxypropyl (HP)-αCD to be twice as effective as HPβCD against migration of human cells of both receptor negative and estrogen-receptor positive breast cancer. If the previous successful animal studies with βCDs are replicated with the safer and more effective αCDs, clinical trials of adjuvant treatment with αCDs are warranted. Ultimately, all breast cancer are expected to benefit from treatment with HPαCD, but women with triple-negative breast cancer (TNBC) will benefit most, because they have fewer treatment options and their cancer advances more aggressively.
Collapse
Affiliation(s)
- Knut M. Wittkowski
- Center for Clinical and Translational Science, The Rockefeller University, New York, New York, United States of America
| | - Christina Dadurian
- Center for Clinical and Translational Science, The Rockefeller University, New York, New York, United States of America
| | - Martin P. Seybold
- Institut für Formale Methoden der Informatik, Universität Stuttgart, Stuttgart, Germany
| | - Han Sang Kim
- Department of Pediatrics, and Cell and Developmental Biology Weill Medical College of Cornell University, New York, New York, United States of America
| | - Ayuko Hoshino
- Department of Pediatrics, and Cell and Developmental Biology Weill Medical College of Cornell University, New York, New York, United States of America
| | - David Lyden
- Department of Pediatrics, and Cell and Developmental Biology Weill Medical College of Cornell University, New York, New York, United States of America
| |
Collapse
|
23
|
Vradi E, Brannath W, Jaki T, Vonk R. Model selection based on combined penalties for biomarker identification. J Biopharm Stat 2017; 28:735-749. [PMID: 29072549 DOI: 10.1080/10543406.2017.1378662] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
The growing role of targeted medicine has led to an increased focus on the development of actionable biomarkers. Current penalized selection methods that are used to identify biomarker panels for classification in high-dimensional data, however, often result in highly complex panels that need careful pruning for practical use. In the framework of regularization methods, a penalty that is a weighted sum of the L1 and L0 norm has been proposed to account for the complexity of the resulting model. In practice, the limitation of this penalty is that the objective function is non-convex, non-smooth, the optimization is computationally intensive and the application to high-dimensional settings is challenging. In this paper, we propose a stepwise forward variable selection method which combines the L0 with L1 or L2 norms. The penalized likelihood criterion that is used in the stepwise selection procedure results in more parsimonious models, keeping only the most relevant features. Simulation results and a real application show that our approach exhibits a comparable performance with common selection methods with respect to the prediction performance while minimizing the number of variables in the selected model resulting in a more parsimonious model as desired.
Collapse
Affiliation(s)
- Eleni Vradi
- a Department of Research and Clinical Sciences Statistics , Bayer AG , Berlin , Germany
| | - Werner Brannath
- b Institute of Statistics, Competence Center for Clinical Trials Bremen , Faculty 3, University of Bremen , Bremen , Germany
| | - Thomas Jaki
- c Department of Mathematics and Statistics , Medical and Pharmaceutical Statistics Research Unit, Lancaster University , Lancaster , United Kingdom
| | - Richardus Vonk
- a Department of Research and Clinical Sciences Statistics , Bayer AG , Berlin , Germany
| |
Collapse
|
24
|
Hugelier S, Piqueras S, Bedia C, de Juan A, Ruckebusch C. Application of a sparseness constraint in multivariate curve resolution - Alternating least squares. Anal Chim Acta 2017; 1000:100-108. [PMID: 29289299 DOI: 10.1016/j.aca.2017.08.021] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2017] [Revised: 07/25/2017] [Accepted: 08/19/2017] [Indexed: 11/28/2022]
Abstract
The use of sparseness in chemometrics is a concept that has increased in popularity. The advantage is, above all, a better interpretability of the results obtained. In this work, sparseness is implemented as a constraint in multivariate curve resolution - alternating least squares (MCR-ALS), which aims at reproducing raw (mixed) data by a bilinear model of chemically meaningful profiles. In many cases, the mixed raw data analyzed are not sparse by nature, but their decomposition profiles can be, as it is the case in some instrumental responses, such as mass spectra, or in concentration profiles linked to scattered distribution maps of powdered samples in hyperspectral images. To induce sparseness in the constrained profiles, one-dimensional and/or two-dimensional numerical arrays can be fitted using a basis of Gaussian functions with a penalty on the coefficients. In this work, a least squares regression framework with L0-norm penalty is applied. This L0-norm penalty constrains the number of non-null coefficients in the fit of the array constrained without having an a priori on the number and their positions. It has been shown that the sparseness constraint induces the suppression of values linked to uninformative channels and noise in MS spectra and improves the location of scattered compounds in distribution maps, resulting in a better interpretability of the constrained profiles. An additional benefit of the sparseness constraint is a lower ambiguity in the bilinear model, since the major presence of null coefficients in the constrained profiles also helps to limit the solutions for the profiles in the counterpart matrix of the MCR bilinear model.
Collapse
Affiliation(s)
- Siewert Hugelier
- Université de Lille, Sciences et Technologies, LASIR, CNRS, F-59000 Lille, France.
| | - Sara Piqueras
- Chemometrics Group, Universitat de Barcelona, Diagonal 645, 08028 Barcelona, Spain; Department of Environmental Chemistry, IDAEA-CSIC, Calle Jordi Girona 18-26, 08034 Barcelona, Spain
| | - Carmen Bedia
- Department of Environmental Chemistry, IDAEA-CSIC, Calle Jordi Girona 18-26, 08034 Barcelona, Spain
| | - Anna de Juan
- Chemometrics Group, Universitat de Barcelona, Diagonal 645, 08028 Barcelona, Spain
| | - Cyril Ruckebusch
- Université de Lille, Sciences et Technologies, LASIR, CNRS, F-59000 Lille, France
| |
Collapse
|