1
|
Song X, Liang K, Li J. WGRLR: A Weighted Group Regularized Logistic Regression for Cancer Diagnosis and Gene Selection. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1563-1573. [PMID: 36044492 DOI: 10.1109/tcbb.2022.3203167] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Sparse regressions applied to cancer diagnosis suffer from noise reduction, gene grouping, and group significance evaluation. This paper presented the weighted group regularized logistic regression (WGRLR) for dealing with the above problems. Clean data was separated from noisy gene expression profile data, based on which gene grouping and model building were performed. An interpretable gene group significance evaluation criterion was proposed based on symmetrical uncertainty and module eigengene. A group-wise individual gene significance evaluation criterion was also presented. The performances of the proposed method were compared with WGGL, ASGL-CMI, SGL, GL, Elastic Net, and lasso on acute leukemia and brain cancer data. Experimental results demonstrate that the proposed method is superior to the other six methods in cancer diagnosis accuracy and gene selection.
Collapse
|
2
|
Spirko-Burns L, Devarajan K. Supervised Dimension Reduction for Large-Scale "Omics" Data With Censored Survival Outcomes Under Possible Non-Proportional Hazards. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2032-2044. [PMID: 31940547 DOI: 10.1109/tcbb.2020.2965934] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The past two decades have witnessed significant advances in high-throughput "omics" technologies such as genomics, proteomics, metabolomics, transcriptomics and radiomics. These technologies have enabled simultaneous measurement of the expression levels of tens of thousands of features from individual patient samples and have generated enormous amounts of data that require analysis and interpretation. One specific area of interest has been in studying the relationship between these features and patient outcomes, such as overall and recurrence-free survival, with the goal of developing a predictive "omics" profile. Large-scale studies often suffer from the presence of a large fraction of censored observations and potential time-varying effects of features, and methods for handling them have been lacking. In this paper, we propose supervised methods for feature selection and survival prediction that simultaneously deal with both issues. Our approach utilizes continuum power regression (CPR) - a framework that includes a variety of regression methods - in conjunction with the parametric or semi-parametric accelerated failure time (AFT) model. Both CPR and AFT fall within the linear models framework and, unlike black-box models, the proposed prognostic index has a simple yet useful interpretation. We demonstrate the utility of our methods using simulated and publicly available cancer genomics data.
Collapse
|
3
|
Sinnott JA, Cai T. Pathway aggregation for survival prediction via multiple kernel learning. Stat Med 2018; 37:2501-2515. [PMID: 29664143 PMCID: PMC5994931 DOI: 10.1002/sim.7681] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2016] [Revised: 03/10/2018] [Accepted: 03/20/2018] [Indexed: 01/05/2023]
Abstract
Attempts to predict prognosis in cancer patients using high-dimensional genomic data such as gene expression in tumor tissue can be made difficult by the large number of features and the potential complexity of the relationship between features and the outcome. Integrating prior biological knowledge into risk prediction with such data by grouping genomic features into pathways and networks reduces the dimensionality of the problem and could improve prediction accuracy. Additionally, such knowledge-based models may be more biologically grounded and interpretable. Prediction could potentially be further improved by allowing for complex nonlinear pathway effects. The kernel machine framework has been proposed as an effective approach for modeling the nonlinear and interactive effects of genes in pathways for both censored and noncensored outcomes. When multiple pathways are under consideration, one may efficiently select informative pathways and aggregate their signals via multiple kernel learning (MKL), which has been proposed for prediction of noncensored outcomes. In this paper, we propose MKL methods for censored survival outcomes. We derive our approach for a general survival modeling framework with a convex objective function and illustrate its application under the Cox proportional hazards and semiparametric accelerated failure time models. Numerical studies demonstrate that the proposed MKL-based prediction methods work well in finite sample and can potentially outperform models constructed assuming linear effects or ignoring the group knowledge. The methods are illustrated with an application to 2 cancer data sets.
Collapse
Affiliation(s)
| | - Tianxi Cai
- Department of Biostatistics, Harvard University, Boston, MA, USA
| |
Collapse
|
4
|
Liu Z, Sun F, Braun J, McGovern DPB, Piantadosi S. Multilevel regularized regression for simultaneous taxa selection and network construction with metagenomic count data. ACTA ACUST UNITED AC 2014; 31:1067-74. [PMID: 25416747 DOI: 10.1093/bioinformatics/btu778] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2014] [Accepted: 11/17/2014] [Indexed: 02/06/2023]
Abstract
MOTIVATION Identifying disease associated taxa and constructing networks for bacteria interactions are two important tasks usually studied separately. In reality, differentiation of disease associated taxa and correlation among taxa may affect each other. One genus can be differentiated because it is highly correlated with another highly differentiated one. In addition, network structures may vary under different clinical conditions. Permutation tests are commonly used to detect differences between networks in distinct phenotypes, and they are time-consuming. RESULTS In this manuscript, we propose a multilevel regularized regression method to simultaneously identify taxa and construct networks. We also extend the framework to allow construction of a common network and differentiated network together. An efficient algorithm with dual formulation is developed to deal with the large-scale n ≪ m problem with a large number of taxa (m) and a small number of samples (n) efficiently. The proposed method is regularized with a general Lp (p ∈ [0, 2]) penalty and models the effects of taxa abundance differentiation and correlation jointly. We demonstrate that it can identify both true and biologically significant genera and network structures. AVAILABILITY AND IMPLEMENTATION Software MLRR in MATLAB is available at http://biostatistics.csmc.edu/mlrr/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zhenqiu Liu
- Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA 90048, USA, Molecular and Computational Biology Program, Department of Biological Sciences, USC, Los Angeles, CA 90089, USA, Department of Pathology and Laboratory Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095, USA and F. Widjaja Foundation - Inflammatory Bowel and Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, CA 90048, USA
| | - Fengzhu Sun
- Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA 90048, USA, Molecular and Computational Biology Program, Department of Biological Sciences, USC, Los Angeles, CA 90089, USA, Department of Pathology and Laboratory Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095, USA and F. Widjaja Foundation - Inflammatory Bowel and Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, CA 90048, USA
| | - Jonathan Braun
- Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA 90048, USA, Molecular and Computational Biology Program, Department of Biological Sciences, USC, Los Angeles, CA 90089, USA, Department of Pathology and Laboratory Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095, USA and F. Widjaja Foundation - Inflammatory Bowel and Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, CA 90048, USA
| | - Dermot P B McGovern
- Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA 90048, USA, Molecular and Computational Biology Program, Department of Biological Sciences, USC, Los Angeles, CA 90089, USA, Department of Pathology and Laboratory Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095, USA and F. Widjaja Foundation - Inflammatory Bowel and Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, CA 90048, USA
| | - Steven Piantadosi
- Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA 90048, USA, Molecular and Computational Biology Program, Department of Biological Sciences, USC, Los Angeles, CA 90089, USA, Department of Pathology and Laboratory Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095, USA and F. Widjaja Foundation - Inflammatory Bowel and Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, CA 90048, USA
| |
Collapse
|
5
|
Sinnott JA, Cai T. Omnibus risk assessment via accelerated failure time kernel machine modeling. Biometrics 2013; 69:861-73. [PMID: 24328713 PMCID: PMC3869038 DOI: 10.1111/biom.12098] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2012] [Revised: 07/01/2013] [Accepted: 07/01/2013] [Indexed: 01/05/2023]
Abstract
Integrating genomic information with traditional clinical risk factors to improve the prediction of disease outcomes could profoundly change the practice of medicine. However, the large number of potential markers and possible complexity of the relationship between markers and disease make it difficult to construct accurate risk prediction models. Standard approaches for identifying important markers often rely on marginal associations or linearity assumptions and may not capture non-linear or interactive effects. In recent years, much work has been done to group genes into pathways and networks. Integrating such biological knowledge into statistical learning could potentially improve model interpretability and reliability. One effective approach is to employ a kernel machine (KM) framework, which can capture nonlinear effects if nonlinear kernels are used (Scholkopf and Smola, 2002; Liu et al., 2007, 2008). For survival outcomes, KM regression modeling and testing procedures have been derived under a proportional hazards (PH) assumption (Li and Luan, 2003; Cai, Tonini, and Lin, 2011). In this article, we derive testing and prediction methods for KM regression under the accelerated failure time (AFT) model, a useful alternative to the PH model. We approximate the null distribution of our test statistic using resampling procedures. When multiple kernels are of potential interest, it may be unclear in advance which kernel to use for testing and estimation. We propose a robust Omnibus Test that combines information across kernels, and an approach for selecting the best kernel for estimation. The methods are illustrated with an application in breast cancer.
Collapse
Affiliation(s)
- Jennifer A. Sinnott
- Department of Biostatistics, Harvard School of Public Health, 655 Huntington Avenue, Boston, MA 02115, USA
| | - Tianxi Cai
- Department of Biostatistics, Harvard School of Public Health, 655 Huntington Avenue, Boston, MA 02115, USA
| |
Collapse
|