1
|
Zhao Z, Zobolas J, Zucknick M, Aittokallio T. Tutorial on survival modeling with applications to omics data. Bioinformatics 2024; 40:btae132. [PMID: 38445722 PMCID: PMC10973942 DOI: 10.1093/bioinformatics/btae132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Revised: 02/22/2024] [Accepted: 03/04/2024] [Indexed: 03/07/2024] Open
Abstract
MOTIVATION Identification of genomic, molecular and clinical markers prognostic of patient survival is important for developing personalized disease prevention, diagnostic and treatment approaches. Modern omics technologies have made it possible to investigate the prognostic impact of markers at multiple molecular levels, including genomics, epigenomics, transcriptomics, proteomics and metabolomics, and how these potential risk factors complement clinical characterization of patient outcomes for survival prognosis. However, the massive sizes of the omics datasets, along with their correlation structures, pose challenges for studying relationships between the molecular information and patients' survival outcomes. RESULTS We present a general workflow for survival analysis that is applicable to high-dimensional omics data as inputs when identifying survival-associated features and validating survival models. In particular, we focus on the commonly used Cox-type penalized regressions and hierarchical Bayesian models for feature selection in survival analysis, which are especially useful for high-dimensional data, but the framework is applicable more generally. AVAILABILITY AND IMPLEMENTATION A step-by-step R tutorial using The Cancer Genome Atlas survival and omics data for the execution and evaluation of survival models has been made available at https://ocbe-uio.github.io/survomics.
Collapse
Affiliation(s)
- Zhi Zhao
- Oslo Centre for Biostatistics and Epidemiology (OCBE), Department of Biostatistics, Faculty of Medicine, University of Oslo, Oslo 0372, Norway
- Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital, Oslo 0310, Norway
| | - John Zobolas
- Oslo Centre for Biostatistics and Epidemiology (OCBE), Department of Biostatistics, Faculty of Medicine, University of Oslo, Oslo 0372, Norway
- Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital, Oslo 0310, Norway
| | - Manuela Zucknick
- Oslo Centre for Biostatistics and Epidemiology (OCBE), Department of Biostatistics, Faculty of Medicine, University of Oslo, Oslo 0372, Norway
- Oslo Centre for Biostatistics and Epidemiology (OCBE), Research Support Services, Oslo University Hospital, Oslo 0372, Norway
| | - Tero Aittokallio
- Oslo Centre for Biostatistics and Epidemiology (OCBE), Department of Biostatistics, Faculty of Medicine, University of Oslo, Oslo 0372, Norway
- Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital, Oslo 0310, Norway
- Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki FI-00014, Finland
| |
Collapse
|
2
|
Huang TJ, Luedtke A, McKeague IW. EFFICIENT ESTIMATION OF THE MAXIMAL ASSOCIATION BETWEEN MULTIPLE PREDICTORS AND A SURVIVAL OUTCOME. Ann Stat 2023; 51:1965-1988. [PMID: 38405375 PMCID: PMC10888526 DOI: 10.1214/23-aos2313] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
This paper develops a new approach to post-selection inference for screening high-dimensional predictors of survival outcomes. Post-selection inference for right-censored outcome data has been investigated in the literature, but much remains to be done to make the methods both reliable and computationally-scalable in high-dimensions. Machine learning tools are commonly used to provide predictions of survival outcomes, but the estimated effect of a selected predictor suffers from confirmation bias unless the selection is taken into account. The new approach involves the construction of semi-parametrically efficient estimators of the linear association between the predictors and the survival outcome, which are used to build a test statistic for detecting the presence of an association between any of the predictors and the outcome. Further, a stabilization technique reminiscent of bagging allows a normal calibration for the resulting test statistic, which enables the construction of confidence intervals for the maximal association between predictors and the outcome and also greatly reduces computational cost. Theoretical results show that this testing procedure is valid even when the number of predictors grows superpolynomially with sample size, and our simulations support this asymptotic guarantee at moderate sample sizes. The new approach is applied to the problem of identifying patterns in viral gene expression associated with the potency of an antiviral drug.
Collapse
Affiliation(s)
- Tzu-Jung Huang
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center
| | - Alex Luedtke
- Department of Statistics, University of Washington
| | | |
Collapse
|
3
|
Rahnenführer J, De Bin R, Benner A, Ambrogi F, Lusa L, Boulesteix AL, Migliavacca E, Binder H, Michiels S, Sauerbrei W, McShane L. Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges. BMC Med 2023; 21:182. [PMID: 37189125 DOI: 10.1186/s12916-023-02858-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/28/2022] [Accepted: 04/03/2023] [Indexed: 05/17/2023] Open
Abstract
BACKGROUND In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions. METHODS Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 "High-dimensional data" of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD. RESULTS The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided. CONCLUSIONS This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses.
Collapse
Affiliation(s)
| | | | - Axel Benner
- Division of Biostatistics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Federico Ambrogi
- Department of Clinical Sciences and Community Health, University of Milan, Milan, Italy
- Scientific Directorate, IRCCS Policlinico San Donato, San Donato Milanese, Italy
| | - Lara Lusa
- Department of Mathematics, Faculty of Mathematics, Natural Sciences and Information Technology, University of Primorksa, Koper, Slovenia
- Institute of Biostatistics and Medical Informatics, University of Ljubljana, Ljubljana, Slovenia
| | - Anne-Laure Boulesteix
- Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig Maximilian University of Munich, Munich, Germany
| | | | - Harald Binder
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Stefan Michiels
- Service de Biostatistique et d'Épidémiologie, Gustave Roussy, Université Paris-Saclay, Villejuif, France
- Oncostat U1018, Inserm, Université Paris-Saclay, Labeled Ligue Contre le Cancer, Villejuif, France
| | - Willi Sauerbrei
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Lisa McShane
- Biometric Research Program, Division of Cancer Treatment and Diagnosis, National Cancer Institute, Bethesda, MD, USA.
| |
Collapse
|
4
|
Samuelsen SO, Aalen OO. Special issue dedicated to Ørnulf Borgan. LIFETIME DATA ANALYSIS 2023; 29:253-255. [PMID: 36807014 PMCID: PMC9937859 DOI: 10.1007/s10985-023-09592-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/23/2023] [Accepted: 02/01/2023] [Indexed: 06/18/2023]
Affiliation(s)
| | - O. O. Aalen
- Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics, University of Oslo, Oslo, Norway
| |
Collapse
|
5
|
De Bin R, Stikbakke VG. A boosting first-hitting-time model for survival analysis in high-dimensional settings. LIFETIME DATA ANALYSIS 2023; 29:420-440. [PMID: 35476164 PMCID: PMC10006065 DOI: 10.1007/s10985-022-09553-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Accepted: 03/25/2022] [Indexed: 06/13/2023]
Abstract
In this paper we propose a boosting algorithm to extend the applicability of a first hitting time model to high-dimensional frameworks. Based on an underlying stochastic process, first hitting time models do not require the proportional hazards assumption, hardly verifiable in the high-dimensional context, and represent a valid parametric alternative to the Cox model for modelling time-to-event responses. First hitting time models also offer a natural way to integrate low-dimensional clinical and high-dimensional molecular information in a prediction model, that avoids complicated weighting schemes typical of current methods. The performance of our novel boosting algorithm is illustrated in three real data examples.
Collapse
Affiliation(s)
- Riccardo De Bin
- Department of Mathematics, University of Oslo, Moltke Moes vei 35, 0851 Oslo, Norway
| | | |
Collapse
|
6
|
Ng HM, Jiang B, Wong KY. Penalized estimation of a class of single-index varying-coefficient models for integrative genomic analysis. Biom J 2023; 65:e2100139. [PMID: 35837982 DOI: 10.1002/bimj.202100139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2021] [Revised: 04/15/2022] [Accepted: 05/27/2022] [Indexed: 01/17/2023]
Abstract
Recent technological advances have made it possible to collect high-dimensional genomic data along with clinical data on a large number of subjects. In the studies of chronic diseases such as cancer, it is of great interest to integrate clinical and genomic data to build a comprehensive understanding of the disease mechanisms. Despite extensive studies on integrative analysis, it remains an ongoing challenge to model the interaction effects between clinical and genomic variables, due to high dimensionality of the data and heterogeneity across data types. In this paper, we propose an integrative approach that models interaction effects using a single-index varying-coefficient model, where the effects of genomic features can be modified by clinical variables. We propose a penalized approach for separate selection of main and interaction effects. Notably, the proposed methods can be applied to right-censored survival outcomes based on a Cox proportional hazards model. We demonstrate the advantages of the proposed methods through extensive simulation studies and provide applications to a motivating cancer genomic study.
Collapse
Affiliation(s)
- Hoi Min Ng
- Department of Applied Mathematics, The Hong Kong Polytechnic University, Hong Kong
| | - Binyan Jiang
- Department of Applied Mathematics, The Hong Kong Polytechnic University, Hong Kong
| | - Kin Yau Wong
- Department of Applied Mathematics, The Hong Kong Polytechnic University, Hong Kong
| |
Collapse
|
7
|
Jardillier R, Koca D, Chatelain F, Guyon L. Prognosis of lasso-like penalized Cox models with tumor profiling improves prediction over clinical data alone and benefits from bi-dimensional pre-screening. BMC Cancer 2022; 22:1045. [PMID: 36199072 PMCID: PMC9533541 DOI: 10.1186/s12885-022-10117-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2022] [Accepted: 09/14/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Prediction of patient survival from tumor molecular '-omics' data is a key step toward personalized medicine. Cox models performed on RNA profiling datasets are popular for clinical outcome predictions. But these models are applied in the context of "high dimension", as the number p of covariates (gene expressions) greatly exceeds the number n of patients and e of events. Thus, pre-screening together with penalization methods are widely used for dimensional reduction. METHODS In the present paper, (i) we benchmark the performance of the lasso penalization and three variants (i.e., ridge, elastic net, adaptive elastic net) on 16 cancers from TCGA after pre-screening, (ii) we propose a bi-dimensional pre-screening procedure based on both gene variability and p-values from single variable Cox models to predict survival, and (iii) we compare our results with iterative sure independence screening (ISIS). RESULTS First, we show that integration of mRNA-seq data with clinical data improves predictions over clinical data alone. Second, our bi-dimensional pre-screening procedure can only improve, in moderation, the C-index and/or the integrated Brier score, while excluding irrelevant genes for prediction. We demonstrate that the different penalization methods reached comparable prediction performances, with slight differences among datasets. Finally, we provide advice in the case of multi-omics data integration. CONCLUSIONS Tumor profiles convey more prognostic information than clinical variables such as stage for many cancer subtypes. Lasso and Ridge penalizations perform similarly than Elastic Net penalizations for Cox models in high-dimension. Pre-screening of the top 200 genes in term of single variable Cox model p-values is a practical way to reduce dimension, which may be particularly useful when integrating multi-omics.
Collapse
Affiliation(s)
- Rémy Jardillier
- IRIG, Biosanté U1292, Univ. Grenoble Alpes, Inserm, CEA, Grenoble, France
- GIPSA-lab, Institute of Engineering University Grenoble Alpes, Univ. Grenoble Alpes, CNRS, Grenoble INP, Grenoble, France
| | - Dzenis Koca
- IRIG, Biosanté U1292, Univ. Grenoble Alpes, Inserm, CEA, Grenoble, France
| | - Florent Chatelain
- GIPSA-lab, Institute of Engineering University Grenoble Alpes, Univ. Grenoble Alpes, CNRS, Grenoble INP, Grenoble, France
| | - Laurent Guyon
- IRIG, Biosanté U1292, Univ. Grenoble Alpes, Inserm, CEA, Grenoble, France
| |
Collapse
|
8
|
Zheng X, Amos CI, Frost HR. Pan-cancer evaluation of gene expression and somatic alteration data for cancer prognosis prediction. BMC Cancer 2021; 21:1053. [PMID: 34563154 PMCID: PMC8467202 DOI: 10.1186/s12885-021-08796-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 08/16/2021] [Indexed: 02/04/2023] Open
Abstract
BACKGROUND Over the past decades, approaches for diagnosing and treating cancer have seen significant improvement. However, the variability of patient and tumor characteristics has limited progress on methods for prognosis prediction. The development of high-throughput omics technologies now provides multiple approaches for characterizing tumors. Although a large number of published studies have focused on integration of multi-omics data and use of pathway-level models for cancer prognosis prediction, there still exists a gap of knowledge regarding the prognostic landscape across multi-omics data for multiple cancer types using both gene-level and pathway-level predictors. METHODS In this study, we systematically evaluated three often available types of omics data (gene expression, copy number variation and somatic point mutation) covering both DNA-level and RNA-level features. We evaluated the landscape of predictive performance of these three omics modalities for 33 cancer types in the TCGA using a Lasso or Group Lasso-penalized Cox model and either gene or pathway level predictors. RESULTS We constructed the prognostic landscape using three types of omics data for 33 cancer types on both the gene and pathway levels. Based on this landscape, we found that predictive performance is cancer type dependent and we also highlighted the cancer types and omics modalities that support the most accurate prognostic models. In general, models estimated on gene expression data provide the best predictive performance on either gene or pathway level and adding copy number variation or somatic point mutation data to gene expression data does not improve predictive performance, with some exceptional cohorts including low grade glioma and thyroid cancer. In general, pathway-level models have better interpretative performance, higher stability and smaller model size across multiple cancer types and omics data types relative to gene-level models. CONCLUSIONS Based on this landscape and comprehensively comparison, models estimated on gene expression data provide the best predictive performance on either gene or pathway level. Pathway-level models have better interpretative performance, higher stability and smaller model size relative to gene-level models.
Collapse
Affiliation(s)
- Xingyu Zheng
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Hanover, NH, USA
| | - Christopher I Amos
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Hanover, NH, USA. .,Department of Medicine, Institute for Clinical and Translational Research, Baylor College of Medicine, Houston, TX, USA.
| | - H Robert Frost
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Hanover, NH, USA.
| |
Collapse
|
9
|
Wang G, Qiu C, Zhang C, Hou S, Zhang Q. Construction of a DLBCL Prognostic Signature Based on Tumor Microenvironment. Expert Rev Hematol 2021; 14:679-686. [PMID: 34139942 DOI: 10.1080/17474086.2021.1943349] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
BACKGROUNDS Diffuse large B-cell lymphoma (DLBCL) is a common curable non-Hodgkin's lymphoma. Patients with this disease can be cured after the R-CHOP immunochemotherapy (rituximab, cyclophosphamide, doxorubicin, vincristine, and prednisone). Nonetheless, most cured patients will relapse again and have dismal prognosis. In this study, we aim to identify a potential biomarker by analyzing gene expression data, and to predict patient's survival rate by constructing a risk model. METHODS Firstly, mRNA chip data (GSE87371) and clinical data of DLBCL patients were obtained from Gene Expression Omnibus (GEO). Samples were scored with estimate package. The obtained stromal score (P < 0.05) and ESTIMATE score (P < 0.05) were significantly correlated with the prognosis. Differentially expressed genes (DEGs) screened through the above two scoring methods were intersected and 279 DEGs were obtained. Next, five feature genes (CD163, CLEC4A, COL15A1, GABRB2, IFIT3) were identified by univariate Cox, LASSO and multivariate Cox regression analyses to establish a risk evaluation model. Thereafter, the 5-gene risk model was validated on a validation set. ROC and survival analyses were performed to assess the performance of the model. RESULTS Further analysis showed that the risk model was capable of independently determining the prognosis of patients, and a nomogram was sequentially established. CONCLUSIONS Authors screened DEGs related to ESTIMATE and stromal scores from GEO database, and established a 5-gene prognostic signature through Cox regression analysis and LASSO analysis. The risk model and nomogram will help individuals accurately predict the prognosis of DLBCL patients.
Collapse
Affiliation(s)
- Ganggang Wang
- Department of Lymphatic Oncology, Cancer Center of Shanxi Bethune Hospital, Shanxi, China
| | - Chen Qiu
- Department of Lymphatic Oncology, Cancer Center of Shanxi Bethune Hospital, Shanxi, China
| | - Chan Zhang
- Graduate School of Shanxi Medical University, Shanxi, China
| | - Shuling Hou
- Department of Lymphatic Oncology, Cancer Center of Shanxi Bethune Hospital, Shanxi, China
| | - Qiaohua Zhang
- Department of Lymphatic Oncology, Cancer Center of Shanxi Bethune Hospital, Shanxi, China
| |
Collapse
|
10
|
Engebretsen S, Glad IK. Partially linear monotone methods with automatic variable selection and monotonicity direction discovery. Stat Med 2020; 39:3549-3568. [PMID: 32851696 DOI: 10.1002/sim.8680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2020] [Revised: 05/07/2020] [Accepted: 06/10/2020] [Indexed: 11/10/2022]
Abstract
In many statistical regression and prediction problems, it is reasonable to assume monotone relationships between certain predictor variables and the outcome. Genomic effects on phenotypes are, for instance, often assumed to be monotone. However, in some settings, it may be reasonable to assume a partially linear model, where some of the covariates can be assumed to have a linear effect. One example is a prediction model using both high-dimensional gene expression data, and low-dimensional clinical data, or when combining continuous and categorical covariates. We study methods for fitting the partially linear monotone model, where some covariates are assumed to have a linear effect on the response, and some are assumed to have a monotone (potentially nonlinear) effect. Most existing methods in the literature for fitting such models are subject to the limitation that they have to be provided the monotonicity directions a priori for the different monotone effects. We here present methods for fitting partially linear monotone models which perform both automatic variable selection, and monotonicity direction discovery. The proposed methods perform comparably to, or better than, existing methods, in terms of estimation, prediction, and variable selection performance, in simulation experiments in both classical and high-dimensional data settings.
Collapse
Affiliation(s)
| | - Ingrid K Glad
- Department of Mathematics, University of Oslo, Oslo, Norway
| |
Collapse
|
11
|
Herrmann M, Probst P, Hornung R, Jurinovic V, Boulesteix AL. Large-scale benchmark study of survival prediction methods using multi-omics data. Brief Bioinform 2020; 22:5895463. [PMID: 32823283 PMCID: PMC8138887 DOI: 10.1093/bib/bbaa167] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2020] [Revised: 06/25/2020] [Accepted: 07/03/2020] [Indexed: 12/18/2022] Open
Abstract
Multi-omics data, that is, datasets containing different types of high-dimensional molecular variables, are increasingly often generated for the investigation of various diseases. Nevertheless, questions remain regarding the usefulness of multi-omics data for the prediction of disease outcomes such as survival time. It is also unclear which methods are most appropriate to derive such prediction models. We aim to give some answers to these questions through a large-scale benchmark study using real data. Different prediction methods from machine learning and statistics were applied on 18 multi-omics cancer datasets (35 to 1000 observations, up to 100 000 variables) from the database 'The Cancer Genome Atlas' (TCGA). The considered outcome was the (censored) survival time. Eleven methods based on boosting, penalized regression and random forest were compared, comprising both methods that do and that do not take the group structure of the omics variables into account. The Kaplan-Meier estimate and a Cox model using only clinical variables were used as reference methods. The methods were compared using several repetitions of 5-fold cross-validation. Uno's C-index and the integrated Brier score served as performance metrics. The results indicate that methods taking into account the multi-omics structure have a slightly better prediction performance. Taking this structure into account can protect the predictive information in low-dimensional groups-especially clinical variables-from not being exploited during prediction. Moreover, only the block forest method outperformed the Cox model on average, and only slightly. This indicates, as a by-product of our study, that in the considered TCGA studies the utility of multi-omics data for prediction purposes was limited. Contact:moritz.herrmann@stat.uni-muenchen.de, +49 89 2180 3198 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online. All analyses are reproducible using R code freely available on Github.
Collapse
Affiliation(s)
- Moritz Herrmann
- Department of Statistics, Ludwig Maximilian University, Munich, 80539, Germany
| | - Philipp Probst
- Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig Maximilian University, Munich, 81377, Germany
| | - Roman Hornung
- Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig Maximilian University, Munich, 81377, Germany
| | - Vindi Jurinovic
- Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig Maximilian University, Munich, 81377, Germany
| | - Anne-Laure Boulesteix
- Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig Maximilian University, Munich, 81377, Germany
| |
Collapse
|
12
|
Theilhaber J, Chiron M, Dreymann J, Bergstrom D, Pollard J. Construction and optimization of gene expression signatures for prediction of survival in two-arm clinical trials. BMC Bioinformatics 2020; 21:333. [PMID: 32711453 PMCID: PMC7382041 DOI: 10.1186/s12859-020-03655-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2019] [Accepted: 07/13/2020] [Indexed: 11/17/2022] Open
Abstract
Background Gene expression signatures for the prediction of differential survival of patients undergoing anti-cancer therapies are of great interest because they can be used to prospectively stratify patients entering new clinical trials, or to determine optimal treatment for patients in more routine clinical settings. Unlike prognostic signatures however, predictive signatures require training set data from clinical studies with at least two treatment arms. As two-arm studies with gene expression profiling have been rarer than similar one-arm studies, the methodology for constructing and optimizing predictive signatures has been less prominently explored than for prognostic signatures. Results Focusing on two “use cases” of two-arm clinical trials, one for metastatic colorectal cancer (CRC) patients treated with the anti-angiogenic molecule aflibercept, and the other for triple negative breast cancer (TNBC) patients treated with the small molecule iniparib, we present derivation steps and quantitative and graphical tools for the construction and optimization of signatures for the prediction of progression-free survival based on cross-validated multivariate Cox models. This general methodology is organized around two more specific approaches which we have called subtype correlation (subC) and mechanism-of-action (MOA) modeling, each of which leverage a priori knowledge of molecular subtypes of tumors or drug MOA for a given indication. The tools and concepts presented here include the so-called differential log-hazard ratio, the survival scatter plot, the hazard ratio receiver operating characteristic, the area between curves and the patient selection matrix. In the CRC use case for instance, the resulting signature stratifies the patient population into “sensitive” and “relatively-resistant” groups achieving a more than two-fold difference in the aflibercept-to-control hazard ratios across signature-defined patient groups. Through cross-validation and resampling the probability of generalization of the signature to similar CRC data sets is predicted to be high. Conclusions The tools presented here should be of general use for building and using predictive multivariate signatures in oncology and in other therapeutic areas.
Collapse
Affiliation(s)
| | - Marielle Chiron
- Sanofi Oncology, Centre de Recherche de Vitry-Alfortville, 13 Quai Jules Guesde, 94400, Vitry-sur-Seine, France
| | - Jennifer Dreymann
- Sanofi Oncology, Centre de Recherche de Vitry-Alfortville, 13 Quai Jules Guesde, 94400, Vitry-sur-Seine, France
| | | | - Jack Pollard
- Sanofi Oncology, 270 Albany Street, Cambridge, MA, 02139, USA
| |
Collapse
|
13
|
De Bin R, Boulesteix AL, Benner A, Becker N, Sauerbrei W. Combining clinical and molecular data in regression prediction models: insights from a simulation study. Brief Bioinform 2019; 21:1904-1919. [PMID: 31750518 DOI: 10.1093/bib/bbz136] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2019] [Revised: 09/20/2019] [Accepted: 10/07/2019] [Indexed: 12/15/2022] Open
Abstract
Data integration, i.e. the use of different sources of information for data analysis, is becoming one of the most important topics in modern statistics. Especially in, but not limited to, biomedical applications, a relevant issue is the combination of low-dimensional (e.g. clinical data) and high-dimensional (e.g. molecular data such as gene expressions) data sources in a prediction model. Not only the different characteristics of the data, but also the complex correlation structure within and between the two data sources, pose challenging issues. In this paper, we investigate these issues via simulations, providing some useful insight into strategies to combine low- and high-dimensional data in a regression prediction model. In particular, we focus on the effect of the correlation structure on the results, while accounting for the influence of our specific choices in the design of the simulation study.
Collapse
Affiliation(s)
| | - Anne-Laure Boulesteix
- Institute for Medical Information Processing, Biometry and Epidemiology, University of Munich, Germany
| | - Axel Benner
- Division of Biostatistics, German Cancer Research Centre of Heidelberg, Germany
| | - Natalia Becker
- Division of Biostatistics, German Cancer Research Centre of Heidelberg, Germany
| | - Willi Sauerbrei
- Institute of Medical Biometry and Statistics, University of Freiburg, Germany
| |
Collapse
|
14
|
A Novel Predictor Tool of Biochemical Recurrence after Radical Prostatectomy Based on a Five-MicroRNA Tissue Signature. Cancers (Basel) 2019; 11:cancers11101603. [PMID: 31640261 PMCID: PMC6826532 DOI: 10.3390/cancers11101603] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2019] [Accepted: 10/17/2019] [Indexed: 12/24/2022] Open
Abstract
Within five to ten years after radical prostatectomy (RP), approximately 15–34% of prostate cancer (PCa) patients experience biochemical recurrence (BCR), which is defined as recurrence of serum levels of prostate-specific antigen >0.2 µg/L, indicating probable cancer recurrence. Models using clinicopathological variables for predicting this risk for patients lack accuracy. There is hope that new molecular biomarkers, like microRNAs (miRNAs), could be potential candidates to improve risk prediction. Therefore, we evaluated the BCR prognostic capability of 20 miRNAs, which were selected by a systematic literature review. MiRNA expressions were measured in formalin-fixed, paraffin-embedded (FFPE) tissue RP samples of 206 PCa patients by RT-qPCR. Univariate and multivariate Cox regression analyses were performed, to assess the independent prognostic potential of miRNAs. Internal validation was performed, using bootstrapping and the split-sample method. Five miRNAs (miR-30c-5p/31-5p/141-3p/148a-3p/miR-221-3p) were finally validated as independent prognostic biomarkers. Their prognostic ability and accuracy were evaluated using C-statistics of the obtained prognostic indices in the Cox regression, time-dependent receiver-operating characteristics, and decision curve analyses. Models of miRNAs, combined with relevant clinicopathological factors, were built. The five-miRNA-panel outperformed clinically established BCR scoring systems, while their combination significantly improved predictive power, based on clinicopathological factors alone. We conclude that this miRNA-based-predictor panel will be worth to be including in future studies.
Collapse
|
15
|
Huang TJ, McKeague IW, Qian M. Marginal screening for high-dimensional predictors of survival outcomes. Stat Sin 2019; 29:2105-2139. [PMID: 31938013 PMCID: PMC6959482 DOI: 10.5705/ss.202017.0298] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
This study develops a marginal screening test to detect the presence of significant predictors for a right-censored time-to-event outcome under a high-dimensional accelerated failure time (AFT) model. Establishing a rigorous screening test in this setting is challenging, because of the right censoring and the post-selection inference. In the latter case, an implicit variable selection step needs to be included to avoid inflating the Type-I error. A prior study solved this problem by constructing an adaptive resampling test under an ordinary linear regression. To accommodate right censoring, we develop a new approach based on a maximally selected Koul-Susarla-Van Ryzin estimator from a marginal AFT working model. A regularized bootstrap method is used to calibrate the test. Our test is more powerful and less conservative than both a Bonferroni correction of the marginal tests and other competing methods. The proposed method is evaluated in simulation studies and applied to two real data sets.
Collapse
Affiliation(s)
| | | | - Min Qian
- Department of Biostatistics, Columbia University
| |
Collapse
|
16
|
Volkmann A, De Bin R, Sauerbrei W, Boulesteix AL. A plea for taking all available clinical information into account when assessing the predictive value of omics data. BMC Med Res Methodol 2019; 19:162. [PMID: 31340753 PMCID: PMC6657034 DOI: 10.1186/s12874-019-0802-0] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2019] [Accepted: 07/11/2019] [Indexed: 12/22/2022] Open
Abstract
Background Omics data can be very informative in survival analysis and may improve the prognostic ability of classical models based on clinical risk factors for various diseases, for example breast cancer. Recent research has focused on integrating omics and clinical data, yet has often ignored the need for appropriate model building for clinical variables. Medical literature on classical prognostic scores, as well as biostatistical literature on appropriate model selection strategies for low dimensional (clinical) data, are often ignored in the context of omics research. The goal of this paper is to fill this methodological gap by investigating the added predictive value of gene expression data for models using varying amounts of clinical information. Methods We analyze two data sets from the field of survival prognosis of breast cancer patients. First, we construct several proportional hazards prediction models using varying amounts of clinical information based on established medical knowledge. These models are then used as a starting point (i.e. included as a clinical offset) for identifying informative gene expression variables using resampling procedures and penalized regression approaches (model based boosting and the LASSO). In order to assess the added predictive value of the gene signatures, measures of prediction accuracy and separation are examined on a validation data set for the clinical models and the models that combine the two sources of information. Results For one data set, we do not find any substantial added predictive value of the omics data when compared to clinical models. On the second data set, we identify a noticeable added predictive value, however only for scenarios where little or no clinical information is included in the modeling process. We find that including more clinical information can lead to a smaller number of selected omics predictors. Conclusions New research using omics data should include all available established medical knowledge in order to allow an adequate evaluation of the added predictive value of omics data. Including all relevant clinical information in the analysis might also lead to more parsimonious models. The developed procedure to assess the predictive value of the omics data can be readily applied to other scenarios. Electronic supplementary material The online version of this article (10.1186/s12874-019-0802-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Alexander Volkmann
- Institute for Medical Information Processing, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, 81377, Germany. .,Chair of Statistics, School of Business and Economics, Humboldt-Universität zu Berlin, Spandauer Straße 1, Berlin, 10178, Germany.
| | - Riccardo De Bin
- Department of Mathematics, University of Oslo, Moltke Moes vei 35, Oslo, 0851, Norway
| | - Willi Sauerbrei
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Stefan-Meier-Straße 26, Freiburg, 79104, Germany
| | - Anne-Laure Boulesteix
- Institute for Medical Information Processing, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, 81377, Germany
| |
Collapse
|
17
|
López de Maturana E, Alonso L, Alarcón P, Martín-Antoniano IA, Pineda S, Piorno L, Calle ML, Malats N. Challenges in the Integration of Omics and Non-Omics Data. Genes (Basel) 2019; 10:genes10030238. [PMID: 30897838 PMCID: PMC6471713 DOI: 10.3390/genes10030238] [Citation(s) in RCA: 60] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2019] [Revised: 03/05/2019] [Accepted: 03/14/2019] [Indexed: 11/16/2022] Open
Abstract
Omics data integration is already a reality. However, few omics-based algorithms show enough predictive ability to be implemented into clinics or public health domains. Clinical/epidemiological data tend to explain most of the variation of health-related traits, and its joint modeling with omics data is crucial to increase the algorithm’s predictive ability. Only a small number of published studies performed a “real” integration of omics and non-omics (OnO) data, mainly to predict cancer outcomes. Challenges in OnO data integration regard the nature and heterogeneity of non-omics data, the possibility of integrating large-scale non-omics data with high-throughput omics data, the relationship between OnO data (i.e., ascertainment bias), the presence of interactions, the fairness of the models, and the presence of subphenotypes. These challenges demand the development and application of new analysis strategies to integrate OnO data. In this contribution we discuss different attempts of OnO data integration in clinical and epidemiological studies. Most of the reviewed papers considered only one type of omics data set, mainly RNA expression data. All selected papers incorporated non-omics data in a low-dimensionality fashion. The integrative strategies used in the identified papers adopted three modeling methods: Independent, conditional, and joint modeling. This review presents, discusses, and proposes integrative analytical strategies towards OnO data integration.
Collapse
Affiliation(s)
- Evangelina López de Maturana
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Lola Alonso
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Pablo Alarcón
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Isabel Adoración Martín-Antoniano
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Silvia Pineda
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Lucas Piorno
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - M Luz Calle
- Biosciences Department, University of Vic-Central University of Catalonia, Carrer de la Laura 13, 08570 Vic, Spain.
| | - Núria Malats
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| |
Collapse
|
18
|
Liang X, Li H, Coussy F, Callens C, Lerebours F. An update on biomarkers of potential benefit with bevacizumab for breast cancer treatment: Do we make progress? Chin J Cancer Res 2019; 31:586-600. [PMID: 31564802 PMCID: PMC6736652 DOI: 10.21147/j.issn.1000-9604.2019.04.03] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
As the first monoclonal antibody against vascular endothelial growth factor (VEGF), bevacizumab (BEV) is a definitely controversial antiangiogenic therapy in breast cancer. The initial excitement over improvements in progression-free survival (PFS) with BEV was tempered by an absence of overall survival (OS) benefit and serious adverse effects. Missing targeted population urged us to identify the predictive biomarkers for BEV efficacy. In this review we focus on the research in breast cancer and provide recent investigations on clinical, radiological, molecular and gene profiling markers of BEV efficacy, including the new results from randomized phase III clinical trials evaluating the efficacy of BEV in combination with comprehensive biomarker analyses. Current evidences indicate some predictive values for genetic variants, molecular imaging, VEGF pathway factors or associated factors in peripheral blood and gene profiling. The current challenge is to validate those potential biomarkers and implement them into clinical practice.
Collapse
Affiliation(s)
- Xu Liang
- Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), Department of Breast Oncology, Peking University Cancer Hospital & Institute, Beijing 100142, China.,Pharmacogenomic Unit, Department of Genetics, Curie Institute, PSL Research University, Paris 75005, France
| | - Huiping Li
- Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), Department of Breast Oncology, Peking University Cancer Hospital & Institute, Beijing 100142, China
| | - Florence Coussy
- Department of Medical Oncology, Institut Curie, PSL Research University, Paris 75005, France
| | - Celine Callens
- Pharmacogenomic Unit, Department of Genetics, Curie Institute, PSL Research University, Paris 75005, France
| | - Florence Lerebours
- Department of Medical Oncology, Institut Curie, René Huguenin Hospital, Saint-Cloud 92210, France
| |
Collapse
|
19
|
Bazzoli C, Lambert-Lacroix S. Classification based on extensions of LS-PLS using logistic regression: application to clinical and multiple genomic data. BMC Bioinformatics 2018; 19:314. [PMID: 30189832 PMCID: PMC6127926 DOI: 10.1186/s12859-018-2311-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2018] [Accepted: 08/13/2018] [Indexed: 01/02/2023] Open
Abstract
BACKGROUND To address high-dimensional genomic data, most of the proposed prediction methods make use of genomic data alone without considering clinical data, which are often available and known to have predictive value. Recent studies suggest that combining clinical and genomic information may improve predictions. We consider here methods for classification purposes that simultaneously use both types of variables but apply dimensionality reduction only to the high-dimensional genomic ones. RESULTS Using partial least squares (PLS), we propose some one-step approaches based on three extensions of the least squares (LS)-PLS method for logistic regression. A comparison of their prediction performances via a simulation and on real data sets from cancer studies is conducted. CONCLUSION In general, those methods using only clinical data or only genomic data perform poorly. The advantage of using LS-PLS methods for classification and their performances are shown and then used to analyze clinical and genomic data. The corresponding prediction results are encouraging and stable regardless of the data set and/or number of selected features. These extensions have been implemented in the R package lsplsGlm to enhance their use.
Collapse
Affiliation(s)
- Caroline Bazzoli
- Laboratoire Jean Kuntzman, Univ. Grenoble-Alpes, 700 avenue centrale, Saint Martin d’Hères, 38401 France
| | | |
Collapse
|
20
|
Tang Z, Shen Y, Zhang X, Yi N. The spike-and-slab lasso Cox model for survival prediction and associated genes detection. Bioinformatics 2018; 33:2799-2807. [PMID: 28472220 DOI: 10.1093/bioinformatics/btx300] [Citation(s) in RCA: 43] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2017] [Accepted: 05/05/2017] [Indexed: 12/20/2022] Open
Abstract
Motivation Large-scale molecular profiling data have offered extraordinary opportunities to improve survival prediction of cancers and other diseases and to detect disease associated genes. However, there are considerable challenges in analyzing large-scale molecular data. Results We propose new Bayesian hierarchical Cox proportional hazards models, called the spike-and-slab lasso Cox, for predicting survival outcomes and detecting associated genes. We also develop an efficient algorithm to fit the proposed models by incorporating Expectation-Maximization steps into the extremely fast cyclic coordinate descent algorithm. The performance of the proposed method is assessed via extensive simulations and compared with the lasso Cox regression. We demonstrate the proposed procedure on two cancer datasets with censored survival outcomes and thousands of molecular features. Our analyses suggest that the proposed procedure can generate powerful prognostic models for predicting cancer survival and can detect associated genes. Availability and implementation The methods have been implemented in a freely available R package BhGLM ( http://www.ssg.uab.edu/bhglm/ ). Contact nyi@uab.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zaixiang Tang
- Department of Biostatistics, School of Public Health.,Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, and Center for Genetic Epidemiology and Genomics, Medical College of Soochow University, Suzhou 215123, China.,Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | - Yueping Shen
- Department of Biostatistics, School of Public Health.,Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, and Center for Genetic Epidemiology and Genomics, Medical College of Soochow University, Suzhou 215123, China
| | - Xinyan Zhang
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | - Nengjun Yi
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| |
Collapse
|
21
|
Mendiola M, Martínez-Marin V, Herranz J, Heredia V, Yébenes L, Zamora P, Castelo B, Pinto Á, Miguel M, Díaz E, Gámez A, Fresno JÁ, Ramírez de Molina A, Hardisson D, Espinosa E, Redondo A. Predictive value of angiogenesis-related gene profiling in patients with HER2-negative metastatic breast cancer treated with bevacizumab and weekly paclitaxel. Oncotarget 2018; 7:24217-27. [PMID: 26992213 PMCID: PMC5029696 DOI: 10.18632/oncotarget.8128] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2015] [Accepted: 02/25/2016] [Indexed: 01/15/2023] Open
Abstract
Bevacizumab plus weekly paclitaxel improves progression-free survival (PFS) in HER2-negative metastatic breast cancer (mBC), but its use has been questioned due to the absence of a predictive biomarker, lack of benefit in overall survival (OS) and increased toxicity. We examined the baseline tumor angiogenic-related gene expression of 60 patients with mBC with the aim of finding a signature that predicts benefit from this drug. Multivariate analysis by Lasso-penalized Cox regression generated two predictive models: one, named G-model, including 11 genes, and the other one, named GC-model, including 13 genes plus 5 clinical covariates. Both models identified patients with improved PFS (HR (Hazard Ratio) 2.57 and 4.04, respectively) and OS (HR 3.29 and 3.43, respectively). The G-model distinguished low and high risk patients in the first 6 months, whereas the GC-model maintained significance over time.
Collapse
Affiliation(s)
- Marta Mendiola
- Molecular Pathology and Therapeutic Targets Group, La Paz University Hospital - IdiPAZ, Madrid, Spain
| | - Virginia Martínez-Marin
- Department of Medical Oncology, La Paz University Hospital, Madrid, Spain.,Translational Oncology Group, La Paz University Hospital - IdiPAZ, Madrid, Spain
| | - Jesús Herranz
- IMDEA, Campus de Excelencia Internacional CEI (UAM-CSIC), Madrid, Spain
| | - Victoria Heredia
- Molecular Pathology and Therapeutic Targets Group, La Paz University Hospital - IdiPAZ, Madrid, Spain
| | - Laura Yébenes
- Department of Pathology, La Paz University Hospital, Madrid, Spain
| | - Pilar Zamora
- Department of Medical Oncology, La Paz University Hospital, Madrid, Spain.,Translational Oncology Group, La Paz University Hospital - IdiPAZ, Madrid, Spain
| | - Beatriz Castelo
- Department of Medical Oncology, La Paz University Hospital, Madrid, Spain.,Translational Oncology Group, La Paz University Hospital - IdiPAZ, Madrid, Spain
| | - Álvaro Pinto
- Department of Medical Oncology, La Paz University Hospital, Madrid, Spain.,Translational Oncology Group, La Paz University Hospital - IdiPAZ, Madrid, Spain
| | - María Miguel
- Molecular Pathology and Therapeutic Targets Group, La Paz University Hospital - IdiPAZ, Madrid, Spain
| | - Esther Díaz
- Translational Oncology Group, La Paz University Hospital - IdiPAZ, Madrid, Spain
| | - Angelo Gámez
- Translational Oncology Group, La Paz University Hospital - IdiPAZ, Madrid, Spain
| | - Juan Ángel Fresno
- Translational Oncology Group, La Paz University Hospital - IdiPAZ, Madrid, Spain
| | | | - David Hardisson
- Molecular Pathology and Therapeutic Targets Group, La Paz University Hospital - IdiPAZ, Madrid, Spain.,Department of Pathology, La Paz University Hospital, Madrid, Spain
| | - Enrique Espinosa
- Department of Medical Oncology, La Paz University Hospital, Madrid, Spain.,Translational Oncology Group, La Paz University Hospital - IdiPAZ, Madrid, Spain
| | - Andrés Redondo
- Department of Medical Oncology, La Paz University Hospital, Madrid, Spain.,Translational Oncology Group, La Paz University Hospital - IdiPAZ, Madrid, Spain
| |
Collapse
|
22
|
On the choice and influence of the number of boosting steps for high-dimensional linear Cox-models. Comput Stat 2017. [DOI: 10.1007/s00180-017-0773-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
23
|
Jylhävä J, Kananen L, Raitanen J, Marttila S, Nevalainen T, Hervonen A, Jylhä M, Hurme M. Methylomic predictors demonstrate the role of NF-κB in old-age mortality and are unrelated to the aging-associated epigenetic drift. Oncotarget 2017; 7:19228-41. [PMID: 27015559 PMCID: PMC4991378 DOI: 10.18632/oncotarget.8278] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2015] [Accepted: 03/10/2016] [Indexed: 01/24/2023] Open
Abstract
Changes in the DNA methylation (DNAm) landscape have been implicated in aging and cellular senescence. To unravel the role of specific DNAm patterns in late-life survival, we performed genome-wide methylation profiling in nonagenarians (n=111) and determined the performance of the methylomic predictors and conventional risk markers in a longitudinal setting. The survival model containing only the methylomic markers was superior in terms of predictive accuracy compared with the model containing only the conventional predictors or the model containing conventional predictors combined with the methylomic markers. At the 2.55-year follow-up, we identified 19 mortality-associated (false-discovery rate <0.5) CpG sites that mapped to genes functionally clustering around the nuclear factor kappa B (NF-κB) complex. Interestingly, none of the mortality-associated CpG sites overlapped with the established aging-associated DNAm sites. Our results are in line with previous findings on the role of NF-κB in controlling animal life spans and demonstrate the role of this complex in human longevity.
Collapse
Affiliation(s)
- Juulia Jylhävä
- Department of Microbiology and Immunology, School of Medicine, University of Tampere, Tampere, Finland.,Gerontology Research Center, University of Tampere, Tampere, Finland
| | - Laura Kananen
- Department of Microbiology and Immunology, School of Medicine, University of Tampere, Tampere, Finland.,Gerontology Research Center, University of Tampere, Tampere, Finland
| | - Jani Raitanen
- School of Health Sciences, University of Tampere, Tampere, Finland.,UKK Institute for Health Promotion Research, Tampere, Finland
| | - Saara Marttila
- Department of Microbiology and Immunology, School of Medicine, University of Tampere, Tampere, Finland.,Gerontology Research Center, University of Tampere, Tampere, Finland
| | - Tapio Nevalainen
- Department of Microbiology and Immunology, School of Medicine, University of Tampere, Tampere, Finland.,Gerontology Research Center, University of Tampere, Tampere, Finland
| | - Antti Hervonen
- Gerontology Research Center, University of Tampere, Tampere, Finland.,School of Health Sciences, University of Tampere, Tampere, Finland
| | - Marja Jylhä
- Gerontology Research Center, University of Tampere, Tampere, Finland.,School of Health Sciences, University of Tampere, Tampere, Finland
| | - Mikko Hurme
- Department of Microbiology and Immunology, School of Medicine, University of Tampere, Tampere, Finland.,Gerontology Research Center, University of Tampere, Tampere, Finland.,Fimlab Laboratories, Tampere, Finland
| |
Collapse
|
24
|
Tissue-Based MicroRNAs as Predictors of Biochemical Recurrence after Radical Prostatectomy: What Can We Learn from Past Studies? Int J Mol Sci 2017; 18:ijms18102023. [PMID: 28934131 PMCID: PMC5666705 DOI: 10.3390/ijms18102023] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Revised: 09/16/2017] [Accepted: 09/19/2017] [Indexed: 12/17/2022] Open
Abstract
With the increasing understanding of the molecular mechanism of the microRNAs (miRNAs) in prostate cancer (PCa), the predictive potential of miRNAs has received more attention by clinicians and laboratory scientists. Compared with the traditional prognostic tools based on clinicopathological variables, including the prostate-specific antigen, miRNAs may be helpful novel molecular biomarkers of biochemical recurrence for a more accurate risk stratification of PCa patients after radical prostatectomy and may contribute to personalized treatment. Tissue samples from prostatectomy specimens are easily available for miRNA isolation. Numerous studies from different countries have investigated the role of tissue-miRNAs as independent predictors of disease recurrence, either alone or in combination with other clinicopathological factors. For this purpose, a PubMed search was performed for articles published between 2008 and 2017. We compiled a profile of dysregulated miRNAs as potential predictors of biochemical recurrence and discussed their current clinical relevance. Because of differences in analytics, insufficient power and the heterogeneity of studies, and different statistical evaluation methods, limited consistency in results was obvious. Prospective multi-institutional studies with larger sample sizes, harmonized analytics, well-structured external validations, and reasonable study designs are necessary to assess the real prognostic information of miRNAs, in combination with conventional clinicopathological factors, as predictors of biochemical recurrence.
Collapse
|
25
|
Ray B, Liu W, Fenyö D. Adaptive Multiview Nonnegative Matrix Factorization Algorithm for Integration of Multimodal Biomedical Data. Cancer Inform 2017; 16:1176935117725727. [PMID: 28835735 PMCID: PMC5564898 DOI: 10.1177/1176935117725727] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2017] [Accepted: 07/08/2017] [Indexed: 11/16/2022] Open
Abstract
The amounts and types of available multimodal tumor data are rapidly increasing, and their integration is critical for fully understanding the underlying cancer biology and personalizing treatment. However, the development of methods for effectively integrating multimodal data in a principled manner is lagging behind our ability to generate the data. In this article, we introduce an extension to a multiview nonnegative matrix factorization algorithm (NNMF) for dimensionality reduction and integration of heterogeneous data types and compare the predictive modeling performance of the method on unimodal and multimodal data. We also present a comparative evaluation of our novel multiview approach and current data integration methods. Our work provides an efficient method to extend an existing dimensionality reduction method. We report rigorous evaluation of the method on large-scale quantitative protein and phosphoprotein tumor data from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) acquired using state-of-the-art liquid chromatography mass spectrometry. Exome sequencing and RNA-Seq data were also available from The Cancer Genome Atlas for the same tumors. For unimodal data, in case of breast cancer, transcript levels were most predictive of estrogen and progesterone receptor status and copy number variation of human epidermal growth factor receptor 2 status. For ovarian and colon cancers, phosphoprotein and protein levels were most predictive of tumor grade and stage and residual tumor, respectively. When multiview NNMF was applied to multimodal data to predict outcomes, the improvement in performance is not overall statistically significant beyond unimodal data, suggesting that proteomics data may contain more predictive information regarding tumor phenotypes than transcript levels, probably due to the fact that proteins are the functional gene products and therefore a more direct measurement of the functional state of the tumor. Here, we have applied our proposed approach to multimodal molecular data for tumors, but it is generally applicable to dimensionality reduction and joint analysis of any type of multimodal data.
Collapse
Affiliation(s)
- Bisakha Ray
- Institute for Systems Genetics and Department of Biochemistry and Molecular Pharmacology, NYU School of Medicine, New York, NY, USA
| | - Wenke Liu
- Institute for Systems Genetics and Department of Biochemistry and Molecular Pharmacology, NYU School of Medicine, New York, NY, USA
| | - David Fenyö
- Institute for Systems Genetics and Department of Biochemistry and Molecular Pharmacology, NYU School of Medicine, New York, NY, USA
| |
Collapse
|
26
|
Ternès N, Rotolo F, Michiels S. Robust estimation of the expected survival probabilities from high-dimensional Cox models with biomarker-by-treatment interactions in randomized clinical trials. BMC Med Res Methodol 2017; 17:83. [PMID: 28532387 PMCID: PMC5441049 DOI: 10.1186/s12874-017-0354-0] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2016] [Accepted: 04/27/2017] [Indexed: 11/10/2022] Open
Abstract
Background Thanks to the advances in genomics and targeted treatments, more and more prediction models based on biomarkers are being developed to predict potential benefit from treatments in a randomized clinical trial. Despite the methodological framework for the development and validation of prediction models in a high-dimensional setting is getting more and more established, no clear guidance exists yet on how to estimate expected survival probabilities in a penalized model with biomarker-by-treatment interactions. Methods Based on a parsimonious biomarker selection in a penalized high-dimensional Cox model (lasso or adaptive lasso), we propose a unified framework to: estimate internally the predictive accuracy metrics of the developed model (using double cross-validation); estimate the individual survival probabilities at a given timepoint; construct confidence intervals thereof (analytical or bootstrap); and visualize them graphically (pointwise or smoothed with spline). We compared these strategies through a simulation study covering scenarios with or without biomarker effects. We applied the strategies to a large randomized phase III clinical trial that evaluated the effect of adding trastuzumab to chemotherapy in 1574 early breast cancer patients, for which the expression of 462 genes was measured. Results In our simulations, penalized regression models using the adaptive lasso estimated the survival probability of new patients with low bias and standard error; bootstrapped confidence intervals had empirical coverage probability close to the nominal level across very different scenarios. The double cross-validation performed on the training data set closely mimicked the predictive accuracy of the selected models in external validation data. We also propose a useful visual representation of the expected survival probabilities using splines. In the breast cancer trial, the adaptive lasso penalty selected a prediction model with 4 clinical covariates, the main effects of 98 biomarkers and 24 biomarker-by-treatment interactions, but there was high variability of the expected survival probabilities, with very large confidence intervals. Conclusion Based on our simulations, we propose a unified framework for: developing a prediction model with biomarker-by-treatment interactions in a high-dimensional setting and validating it in absence of external data; accurately estimating the expected survival probability of future patients with associated confidence intervals; and graphically visualizing the developed prediction model. All the methods are implemented in the R package biospear, publicly available on the CRAN. Electronic supplementary material The online version of this article (doi:10.1186/s12874-017-0354-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Nils Ternès
- Service de Biostatistique et d'Epidémiologie, Gustave Roussy, B2M, RdC.114 rue Edouard-Vaillant, 94805, Villejuif, France.,CESP, Fac. de médecine - Univ. Paris-Sud, Fac. de médecine - UVSQ, INSERM, Université Paris-Saclay, Villejuif, 94805, France
| | - Federico Rotolo
- Service de Biostatistique et d'Epidémiologie, Gustave Roussy, B2M, RdC.114 rue Edouard-Vaillant, 94805, Villejuif, France.,CESP, Fac. de médecine - Univ. Paris-Sud, Fac. de médecine - UVSQ, INSERM, Université Paris-Saclay, Villejuif, 94805, France
| | - Stefan Michiels
- Service de Biostatistique et d'Epidémiologie, Gustave Roussy, B2M, RdC.114 rue Edouard-Vaillant, 94805, Villejuif, France. .,CESP, Fac. de médecine - Univ. Paris-Sud, Fac. de médecine - UVSQ, INSERM, Université Paris-Saclay, Villejuif, 94805, France.
| |
Collapse
|
27
|
Emura T, Nakatochi M, Matsui S, Michimae H, Rondeau V. Personalized dynamic prediction of death according to tumour progression and high-dimensional genetic factors: Meta-analysis with a joint model. Stat Methods Med Res 2017; 27:2842-2858. [PMID: 28090814 DOI: 10.1177/0962280216688032] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Developing a personalized risk prediction model of death is fundamental for improving patient care and touches on the realm of personalized medicine. The increasing availability of genomic information and large-scale meta-analytic data sets for clinicians has motivated the extension of traditional survival prediction based on the Cox proportional hazards model. The aim of our paper is to develop a personalized risk prediction formula for death according to genetic factors and dynamic tumour progression status based on meta-analytic data. To this end, we extend the existing joint frailty-copula model to a model allowing for high-dimensional genetic factors. In addition, we propose a dynamic prediction formula to predict death given tumour progression events possibly occurring after treatment or surgery. For clinical use, we implement the computation software of the prediction formula in the joint.Cox R package. We also develop a tool to validate the performance of the prediction formula by assessing the prediction error. We illustrate the method with the meta-analysis of individual patient data on ovarian cancer patients.
Collapse
Affiliation(s)
- Takeshi Emura
- 1 Graduate Institute of Statistics, National Central University, Taoyuan City, Taiwan
| | - Masahiro Nakatochi
- 2 Statistical Analysis Section, Center for Advanced Medicine and Clinical Research, Nagoya University Hospital, Nagoya, Japan
| | - Shigeyuki Matsui
- 3 Department of Biostatistics, Nagoya University Graduate School of Medicine, Nagoya, Japan
| | - Hirofumi Michimae
- 4 Department of Clinical Medicine (Biostatistics), School of Pharmacy, Kitasato University, Tokyo, Japan
| | - Virginie Rondeau
- 5 INSERM CR1219 (Biostatistic), Université de Bordeaux, Bordeaux Cedex, France
| |
Collapse
|
28
|
De Bin R. Overview of Topics Related to Model Selection for Regression. TRENDS IN MATHEMATICS 2017:77-82. [DOI: 10.1007/978-3-319-55639-0_13] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
|
29
|
Tang Z, Shen Y, Zhang X, Yi N. The Spike-and-Slab Lasso Generalized Linear Models for Prediction and Associated Genes Detection. Genetics 2017; 205:77-88. [PMID: 27799277 PMCID: PMC5223525 DOI: 10.1534/genetics.116.192195] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2016] [Accepted: 10/27/2016] [Indexed: 11/18/2022] Open
Abstract
Large-scale "omics" data have been increasingly used as an important resource for prognostic prediction of diseases and detection of associated genes. However, there are considerable challenges in analyzing high-dimensional molecular data, including the large number of potential molecular predictors, limited number of samples, and small effect of each predictor. We propose new Bayesian hierarchical generalized linear models, called spike-and-slab lasso GLMs, for prognostic prediction and detection of associated genes using large-scale molecular data. The proposed model employs a spike-and-slab mixture double-exponential prior for coefficients that can induce weak shrinkage on large coefficients, and strong shrinkage on irrelevant coefficients. We have developed a fast and stable algorithm to fit large-scale hierarchal GLMs by incorporating expectation-maximization (EM) steps into the fast cyclic coordinate descent algorithm. The proposed approach integrates nice features of two popular methods, i.e., penalized lasso and Bayesian spike-and-slab variable selection. The performance of the proposed method is assessed via extensive simulation studies. The results show that the proposed approach can provide not only more accurate estimates of the parameters, but also better prediction. We demonstrate the proposed procedure on two cancer data sets: a well-known breast cancer data set consisting of 295 tumors, and expression data of 4919 genes; and the ovarian cancer data set from TCGA with 362 tumors, and expression data of 5336 genes. Our analyses show that the proposed procedure can generate powerful models for predicting outcomes and detecting associated genes. The methods have been implemented in a freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/).
Collapse
Affiliation(s)
- Zaixiang Tang
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou 215123, China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases and Center for Genetic Epidemiology and Genomics, Medical College of Soochow University, Suzhou 215123, China
- Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Alabama 35294
| | - Yueping Shen
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou 215123, China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases and Center for Genetic Epidemiology and Genomics, Medical College of Soochow University, Suzhou 215123, China
| | - Xinyan Zhang
- Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Alabama 35294
| | - Nengjun Yi
- Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Alabama 35294
| |
Collapse
|
30
|
Zhang X, Li Y, Akinyemiju T, Ojesina AI, Buckhaults P, Liu N, Xu B, Yi N. Pathway-Structured Predictive Model for Cancer Survival Prediction: A Two-Stage Approach. Genetics 2017; 205:89-100. [PMID: 28049703 PMCID: PMC5223526 DOI: 10.1534/genetics.116.189191] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2016] [Accepted: 10/31/2016] [Indexed: 12/11/2022] Open
Abstract
Heterogeneity in terms of tumor characteristics, prognosis, and survival among cancer patients has been a persistent problem for many decades. Currently, prognosis and outcome predictions are made based on clinical factors and/or by incorporating molecular profiling data. However, inaccurate prognosis and prediction may result by using only clinical or molecular information directly. One of the main shortcomings of past studies is the failure to incorporate prior biological information into the predictive model, given strong evidence of the pathway-based genetic nature of cancer, i.e., the potential for oncogenes to be grouped into pathways based on biological functions such as cell survival, proliferation, and metastatic dissemination. To address this problem, we propose a two-stage approach to incorporate pathway information into the prognostic modeling using large-scale gene expression data. In the first stage, we fit all predictors within each pathway using the penalized Cox model and Bayesian hierarchical Cox model. In the second stage, we combine the cross-validated prognostic scores of all pathways obtained in the first stage as new predictors to build an integrated prognostic model for prediction. We apply the proposed method to analyze two independent breast and ovarian cancer datasets from The Cancer Genome Atlas (TCGA), predicting overall survival using large-scale gene expression profiling data. The results from both datasets show that the proposed approach not only improves survival prediction compared with the alternative analyses that ignore the pathway information, but also identifies significant biological pathways.
Collapse
Affiliation(s)
- Xinyan Zhang
- Department of Biostatistics, University of Alabama at Birmingham, Alabama 35294
| | - Yan Li
- Department of Biostatistics, University of Alabama at Birmingham, Alabama 35294
| | - Tomi Akinyemiju
- Department of Epidemiology, University of Alabama at Birmingham, Alabama 35294
| | - Akinyemi I Ojesina
- Department of Epidemiology, University of Alabama at Birmingham, Alabama 35294
| | - Phillip Buckhaults
- Department of Drug Discovery and Biomedical Sciences, The South Carolina College of Pharmacy, The University of South Carolina, Columbia, South Carolina 29208
| | - Nianjun Liu
- Department of Epidemiology and Biostatistics, School of Public Health, Indiana University, Bloomington, Indiana 47405
| | - Bo Xu
- Department of Oncology, Southern Research Institute, Birmingham, Alabama 35205
| | - Nengjun Yi
- Department of Biostatistics, University of Alabama at Birmingham, Alabama 35294
| |
Collapse
|
31
|
A two-stage approach for combining gene expression and mutation with clinical data improves survival prediction in myelodysplastic syndromes and ovarian cancer. JOURNAL OF BIOINFORMATICS AND GENOMICS 2016; 1. [PMID: 34377946 DOI: 10.18454/jbg.2016.1.1.2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Motivation Many traditional clinical prognostic factors have been known for cancer for years, but usually provide poor survival prediction. Genomic information is more easily available now which offers opportunities to build more accurate prognostic models. The challenge is how to integrate them to improve survival prediction. The common approach of jointly analyzing all type of covariates directly in one single model may not improve the prediction due to increased model complexity and cannot be easily applied to different datasets. Results We proposed a two-stage procedure to better combine different sources of information for survival prediction, and applied the two-stage procedure in two cancer datasets: myelodysplastic syndromes (MDS) and ovarian cancer. Our analysis suggests that the prediction performance of different data types are very different, and combining clinical, gene expression and mutation data using the two-stage procedure improves survival prediction in terms of improved concordance index and reduced prediction error. Availability and implementation The two-stage procedure can be implemented in BhGLM package which is freely available at http://www.ssg.uab.edu/bhglm/. Contact nyi@uab.edu.
Collapse
|
32
|
Diagnostic and prognostic potential of circulating cell-free genomic and mitochondrial DNA fragments in clear cell renal cell carcinoma patients. Clin Chim Acta 2016; 452:109-19. [DOI: 10.1016/j.cca.2015.11.009] [Citation(s) in RCA: 43] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2015] [Revised: 11/09/2015] [Accepted: 11/09/2015] [Indexed: 01/05/2023]
|
33
|
Zucknick M, Saadati M, Benner A. Nonidentical twins: Comparison of frequentist and Bayesian lasso for Cox models. Biom J 2015; 57:959-81. [PMID: 26417963 DOI: 10.1002/bimj.201400160] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2014] [Revised: 05/12/2015] [Accepted: 05/12/2015] [Indexed: 11/07/2022]
Abstract
One important task in translational cancer research is the search for new prognostic biomarkers to improve survival prognosis for patients. The use of high-throughput technologies allows simultaneous measurement of genome-wide gene expression or other genomic data for all patients in a clinical trial. Penalized likelihood methods such as lasso regression can be applied to such high-dimensional data, where the number of (genomic) covariables is usually much larger than the sample size. There is a connection between the lasso and the Bayesian regression model with independent Laplace priors on the regression parameters, and understanding this connection has been useful for understanding the properties of lasso estimates in linear models (e.g. Park and Casella, 2008). In this paper, we study the lasso in the frequentist and Bayesian frameworks in the context of Cox models. For the Bayesian lasso we extend the approach by Lee et al. (2011). In particular, we impose the lasso penalty only on the genome features, but not on relevant clinical covariates, to allow the mandatory inclusion of important established factors. We investigate the models in high- and low-dimensional simulation settings and in an application to chronic lymphocytic leukemia.
Collapse
Affiliation(s)
- Manuela Zucknick
- Division of Biostatistics, German Cancer Research Center, Heidelberg 69120, Germany.,Oslo Center for Biostatistics and Epidemiology, Department of Biostatistics, Institute of Basic Medical Sciences, University of Oslo, PO Box 1122 Blindern, 0317 Oslo, Norway
| | - Maral Saadati
- Division of Biostatistics, German Cancer Research Center, Heidelberg 69120, Germany
| | - Axel Benner
- Division of Biostatistics, German Cancer Research Center, Heidelberg 69120, Germany
| |
Collapse
|
34
|
Supervised wavelet method to predict patient survival from gene expression data. ScientificWorldJournal 2014; 2014:618412. [PMID: 25538955 PMCID: PMC4235600 DOI: 10.1155/2014/618412] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2014] [Accepted: 10/03/2014] [Indexed: 11/18/2022] Open
Abstract
In microarray studies, the number of samples is relatively small compared to the number of genes per sample. An important aspect of microarray studies is the prediction of patient survival based on their gene expression profile. This naturally calls for the use of a dimension reduction procedure together with the survival prediction model. In this study, a new method based on combining wavelet approximation coefficients and Cox regression was presented. The proposed method was compared with supervised principal component and supervised partial least squares methods. The different fitted Cox models based on supervised wavelet approximation coefficients, the top number of supervised principal components, and partial least squares components were applied to the data. The results showed that the prediction performance of the Cox model based on supervised wavelet feature extraction was superior to the supervised principal components and partial least squares components. The results suggested the possibility of developing new tools based on wavelets for the dimensionally reduction of microarray data sets in the context of survival analysis.
Collapse
|
35
|
De Bin R, Herold T, Boulesteix AL. Added predictive value of omics data: specific issues related to validation illustrated by two case studies. BMC Med Res Methodol 2014; 14:117. [PMID: 25352096 PMCID: PMC4271356 DOI: 10.1186/1471-2288-14-117] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2014] [Accepted: 09/18/2014] [Indexed: 01/06/2023] Open
Abstract
Background In the last years, the importance of independent validation of the prediction ability of a new gene signature has been largely recognized. Recently, with the development of gene signatures which integrate rather than replace the clinical predictors in the prediction rule, the focus has been moved to the validation of the added predictive value of a gene signature, i.e. to the verification that the inclusion of the new gene signature in a prediction model is able to improve its prediction ability. Methods The high-dimensional nature of the data from which a new signature is derived raises challenging issues and necessitates the modification of classical methods to adapt them to this framework. Here we show how to validate the added predictive value of a signature derived from high-dimensional data and critically discuss the impact of the choice of methods on the results. Results The analysis of the added predictive value of two gene signatures developed in two recent studies on the survival of leukemia patients allows us to illustrate and empirically compare different validation techniques in the high-dimensional framework. Conclusions The issues related to the high-dimensional nature of the omics predictors space affect the validation process. An analysis procedure based on repeated cross-validation is suggested.
Collapse
Affiliation(s)
- Riccardo De Bin
- Department of Medical Informatics, Biometry and Epidemiology, Ludwig-Maximilians-Universität, Marchioninistr, 15, 81377 München, Germany.
| | | | | |
Collapse
|
36
|
Identification of a prognostic signature for old-age mortality by integrating genome-wide transcriptomic data with the conventional predictors: the Vitality 90+ Study. BMC Med Genomics 2014; 7:54. [PMID: 25213707 PMCID: PMC4167306 DOI: 10.1186/1755-8794-7-54] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2014] [Accepted: 09/08/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Prediction models for old-age mortality have generally relied upon conventional markers such as plasma-based factors and biophysiological characteristics. However, it is unknown whether the existing markers are able to provide the most relevant information in terms of old-age survival or whether predictions could be improved through the integration of whole-genome expression profiles. METHODS We assessed the predictive abilities of survival models containing only conventional markers, only gene expression data or both types of data together in a Vitality 90+ study cohort consisting of n = 151 nonagenarians. The all-cause death rate was 32.5% (49 of 151 individuals), and the median follow-up time was 2.55 years. RESULTS Three different feature selection models, the penalized Lasso and Ridge regressions and the C-index boosting algorithm, were used to test the genomic data. The Ridge regression model incorporating both the conventional markers and transcripts outperformed the other models. The multivariate Cox regression model was used to adjust for the conventional mortality prediction markers, i.e., the body mass index, frailty index and cell-free DNA level, revealing that 331 transcripts were independently associated with survival. The final mortality-predicting transcriptomic signature derived from the Ridge regression model was mapped to a network that identified nuclear factor kappa beta (NF-κB) as a central node. CONCLUSIONS Together with the loss of physiological reserves, the transcriptomic predictors centered around NF-κB underscored the role of immunoinflammatory signaling, the control of the DNA damage response and cell cycle, and mitochondrial functions as the key determinants of old-age mortality.
Collapse
|
37
|
De Bin R, Sauerbrei W, Boulesteix AL. Investigating the prediction ability of survival models based on both clinical and omics data: two case studies. Stat Med 2014; 33:5310-29. [PMID: 25042390 DOI: 10.1002/sim.6246] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2013] [Revised: 04/22/2014] [Accepted: 05/31/2014] [Indexed: 12/25/2022]
Abstract
In biomedical literature, numerous prediction models for clinical outcomes have been developed based either on clinical data or, more recently, on high-throughput molecular data (omics data). Prediction models based on both types of data, however, are less common, although some recent studies suggest that a suitable combination of clinical and molecular information may lead to models with better predictive abilities. This is probably due to the fact that it is not straightforward to combine data with different characteristics and dimensions (poorly characterized high-dimensional omics data, well-investigated low-dimensional clinical data). In this paper, we analyze two publicly available datasets related to breast cancer and neuroblastoma, respectively, in order to show some possible ways to combine clinical and omics data into a prediction model of time-to-event outcome. Different strategies and statistical methods are exploited. The results are compared and discussed according to different criteria, including the discriminative ability of the models, computed on a validation dataset.
Collapse
Affiliation(s)
- Riccardo De Bin
- Department of Medical Informatics, Biometry and Epidemiology, Ludwig-Maximilians-Universität of Munich, Germany
| | | | | |
Collapse
|
38
|
Sariyar M, Hoffmann I, Binder H. Combining techniques for screening and evaluating interaction terms on high-dimensional time-to-event data. BMC Bioinformatics 2014; 15:58. [PMID: 24571520 PMCID: PMC3945780 DOI: 10.1186/1471-2105-15-58] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2013] [Accepted: 01/28/2014] [Indexed: 11/23/2022] Open
Abstract
Background Molecular data, e.g. arising from microarray technology, is often used for predicting survival probabilities of patients. For multivariate risk prediction models on such high-dimensional data, there are established techniques that combine parameter estimation and variable selection. One big challenge is to incorporate interactions into such prediction models. In this feasibility study, we present building blocks for evaluating and incorporating interactions terms in high-dimensional time-to-event settings, especially for settings in which it is computationally too expensive to check all possible interactions. Results We use a boosting technique for estimation of effects and the following building blocks for pre-selecting interactions: (1) resampling, (2) random forests and (3) orthogonalization as a data pre-processing step. In a simulation study, the strategy that uses all building blocks is able to detect true main effects and interactions with high sensitivity in different kinds of scenarios. The main challenge are interactions composed of variables that do not represent main effects, but our findings are also promising in this regard. Results on real world data illustrate that effect sizes of interactions frequently may not be large enough to improve prediction performance, even though the interactions are potentially of biological relevance. Conclusion Screening interactions through random forests is feasible and useful, when one is interested in finding relevant two-way interactions. The other building blocks also contribute considerably to an enhanced pre-selection of interactions. We determined the limits of interaction detection in terms of necessary effect sizes. Our study emphasizes the importance of making full use of existing methods in addition to establishing new ones.
Collapse
Affiliation(s)
- Murat Sariyar
- Institute of Medical Biostatistics, Epidemiology and Informatics, Medical Center of the Johannes Gutenberg University, Mainz 55131, Germany.
| | | | | |
Collapse
|
39
|
Chen HC, Chen JJ. Assessment of reproducibility of cancer survival risk predictions across medical centers. BMC Med Res Methodol 2013; 13:25. [PMID: 23425000 PMCID: PMC3598915 DOI: 10.1186/1471-2288-13-25] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2013] [Accepted: 02/13/2013] [Indexed: 04/09/2023] Open
Abstract
BACKGROUND Two most important considerations in evaluation of survival prediction models are 1) predictability - ability to predict survival risks accurately and 2) reproducibility - ability to generalize to predict samples generated from different studies. We present approaches for assessment of reproducibility of survival risk score predictions across medical centers. METHODS Reproducibility was evaluated in terms of consistency and transferability. Consistency is the agreement of risk scores predicted between two centers. Transferability from one center to another center is the agreement of the risk scores of the second center predicted by each of the two centers. The transferability can be: 1) model transferability - whether a predictive model developed from one center can be applied to predict the samples generated from other centers and 2) signature transferability - whether signature markers of a predictive model developed from one center can be applied to predict the samples from other centers. We considered eight prediction models, including two clinical models, two gene expression models, and their combinations. Predictive performance of the eight models was evaluated by several common measures. Correlation coefficients between predicted risk scores of different centers were computed to assess reproducibility - consistency and transferability. RESULTS Two public datasets, the lung cancer data generated from four medical centers and colon cancer data generated from two medical centers, were analyzed. The risk score estimates for lung cancer patients predicted by three of four centers agree reasonably well. In general, a good prediction model showed better cross-center consistency and transferability. The risk scores for the colon cancer patients from one (Moffitt) medical center that were predicted by the clinical models developed from the another (Vanderbilt) medical center were shown to have excellent model transferability and signature transferability. CONCLUSIONS This study illustrates an analytical approach to assessing reproducibility of predictive models and signatures. Based on the analyses of the two cancer datasets, we conclude that the models with clinical variables appear to perform reasonable well with high degree of consistency and transferability. There should have more investigations on the reproducibility of prediction models including gene expression data across studies.
Collapse
Affiliation(s)
- Hung-Chia Chen
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR 72079, USA
| | | |
Collapse
|
40
|
Wang YK, Print CG, Crampin EJ. Biclustering reveals breast cancer tumour subgroups with common clinical features and improves prediction of disease recurrence. BMC Genomics 2013; 14:102. [PMID: 23405961 PMCID: PMC3598775 DOI: 10.1186/1471-2164-14-102] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2012] [Accepted: 02/05/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Many studies have revealed correlations between breast tumour phenotypes, variations in gene expression, and patient survival outcomes. The molecular heterogeneity between breast tumours revealed by these studies has allowed prediction of prognosis and has underpinned stratified therapy, where groups of patients with particular tumour types receive specific treatments. The molecular tests used to predict prognosis and stratify treatment usually utilise fixed sets of genomic biomarkers, with the same biomarker sets being used to test all patients. In this paper we suggest that instead of fixed sets of genomic biomarkers, it may be more effective to use a stratified biomarker approach, where optimal biomarker sets are automatically chosen for particular patient groups, analogous to the choice of optimal treatments for groups of similar patients in stratified therapy. We illustrate the effectiveness of a biclustering approach to select optimal gene sets for determining the prognosis of specific strata of patients, based on potentially overlapping, non-discrete molecular characteristics of tumours. RESULTS Biclustering identified tightly co-expressed gene sets in the tumours of restricted subgroups of breast cancer patients. The co-expressed genes in these biclusters were significantly enriched for particular biological annotations and gene regulatory modules associated with breast cancer biology. Tumours identified within the same bicluster were more likely to present with similar clinical features. Bicluster membership combined with clinical information could predict patient prognosis in conditional inference tree and ridge regression class prediction models. CONCLUSIONS The increasing clinical use of genomic profiling demands identification of more effective methods to segregate patients into prognostic and treatment groups. We have shown that biclustering can be used to select optimal gene sets for determining the prognosis of specific strata of patients.
Collapse
Affiliation(s)
- Yi Kan Wang
- Auckland Bioengineering Institute, University of Auckland, Auckland, New Zealand
| | - Cristin G Print
- Department of Molecular Medicine and Pathology, University of Auckland, Auckland, New Zealand
- New Zealand Bioinformatics Institute, University of Auckland, Auckland, New Zealand
- Maurice Wilkins Centre for Molecular Biodiscovery, University of Auckland, Auckland, New Zealand
| | - Edmund J Crampin
- Auckland Bioengineering Institute, University of Auckland, Auckland, New Zealand
- Maurice Wilkins Centre for Molecular Biodiscovery, University of Auckland, Auckland, New Zealand
- Department of Engineering Science, University of Auckland, Auckland, New Zealand
- Melbourne School of Engineering, University of Melbourne, Victoria, Australia
| |
Collapse
|
41
|
Diagnostic and prognostic potential of differentially expressed miRNAs between metastatic and non-metastatic renal cell carcinoma at the time of nephrectomy. Clin Chim Acta 2012. [PMID: 23178446 DOI: 10.1016/j.cca.2012.11.010] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
BACKGROUND MicroRNAs are promising diagnostic and prognostic biomarkers in oncology. We aimed to evaluate the prognostic potential of selected microRNAs in primary clear cell renal cell carcinomas (ccRCC) as predictors of tumor recurrence after radical nephrectomy. METHODS miR-122, miR-141, miR-155, miR-184, miR-200c, miR-210, miR-224, and miR-514, validated as differentially expressed in a previous study, were measured by RT-PCR in matched malignant and non-malignant tumor samples after nephrectomy from 111 patients (89 without, 22 with metastases) and clinicopathological and outcome data were collected. Non-parametric statistical tests, receiver-operating characteristics, Kaplan-Meier-, and univariate as well as multivariate Cox regression analyses were performed. RESULTS Downregulation of miR-141/-184/-200c/-514 and upregulation of miR-122/-155/-210/-224 were not different between samples of non-metastatic and metastatic tumors except for miR-122 and miR-514. miR-514 was further downregulated in metastatic compared with non-metastatic tumors while the upregulation of miR-122 was significantly reduced in metastatic carcinomas. All miRNAs were suitable to discriminate malignant from non-malignant tissue. miR-122 and miR-514 were significantly related to the recurrence risk but only miR-514 provided independent prognostic information in the final model including relevant clinicopathological variables. CONCLUSIONS MiR-122 and miR-514 play a role in tumor recurrence after nephrectomy. Expression of miR-514 was particularly downregulated in primary metastatic tumor and those that recur and might be a suitable adjunct marker for predicting tumor recurrence.
Collapse
|
42
|
Pang H, George SL, Hui K, Tong T. Gene selection using iterative feature elimination random forests for survival outcomes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1422-31. [PMID: 22547432 PMCID: PMC3495190 DOI: 10.1109/tcbb.2012.63] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Although many feature selection methods for classification have been developed, there is a need to identify genes in high-dimensional data with censored survival outcomes. Traditional methods for gene selection in classification problems have several drawbacks. First, the majority of the gene selection approaches for classification are single-gene based. Second, many of the gene selection procedures are not embedded within the algorithm itself. The technique of random forests has been found to perform well in high-dimensional data settings with survival outcomes. It also has an embedded feature to identify variables of importance. Therefore, it is an ideal candidate for gene selection in high-dimensional data with survival outcomes. In this paper, we develop a novel method based on the random forests to identify a set of prognostic genes. We compare our method with several machine learning methods and various node split criteria using several real data sets. Our method performed well in both simulations and real data analysis.Additionally, we have shown the advantages of our approach over single-gene-based approaches. Our method incorporates multivariate correlations in microarray data for survival outcomes. The described method allows us to better utilize the information available from microarray data with survival outcomes.
Collapse
Affiliation(s)
- Herbert Pang
- Biostatistics and Bioinformatics Department, Duke University School of Medicine, Durham, NC 27705.
| | - Stephen L. George
- Biostatistics and Bioinformatics Department, Duke University School of Medicine, Durham, NC 27705.
| | - Ken Hui
- School of Medicine, Yale University, New Haven, CT 06510.
| | - Tiejun Tong
- Mathematics Department, Hong Kong Baptist University, Hong Kong SAR, China.
| |
Collapse
|
43
|
Choi I, Kattan MW, Wells BJ, Yu C. A hybrid approach to survival model building using integration of clinical and molecular information in censored data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1091-1105. [PMID: 22350208 DOI: 10.1109/tcbb.2012.31] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
In medical society, the prognostic models, which use clinicopathologic features and predict prognosis after a certain treatment, have been externally validated and used in practice. In recent years, most research has focused on high dimensional genomic data and small sample sizes. Since clinically similar but molecularly heterogeneous tumors may produce different clinical outcomes, the combination of clinical and genomic information, which may be complementary, is crucial to improve the quality of prognostic predictions. However, there is a lack of an integrating scheme for clinic-genomic models due to the P ≥ N problem, in particular, for a parsimonious model. We propose a methodology to build a reduced yet accurate integrative model using a hybrid approach based on the Cox regression model, which uses several dimension reduction techniques, L₂ penalized maximum likelihood estimation (PMLE), and resampling methods to tackle the problem. The predictive accuracy of the modeling approach is assessed by several metrics via an independent and thorough scheme to compare competing methods. In breast cancer data studies on a metastasis and death event, we show that the proposed methodology can improve prediction accuracy and build a final model with a hybrid signature that is parsimonious when integrating both types of variables.
Collapse
Affiliation(s)
- Ickwon Choi
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, 16000 Terrace Rd #503, Cleveland, OH 44112, USA.
| | | | | | | |
Collapse
|
44
|
Khoshhali M, Mahjub H, Saidijam M, Poorolajal J, Soltanian AR. Predicting the survival time for diffuse large B-cell lymphoma using microarray data. J Mol Genet Med 2012; 6:287-92. [PMID: 23173013 PMCID: PMC3410377 DOI: 10.4172/1747-0862.1000051] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2011] [Revised: 04/26/2012] [Accepted: 04/30/2012] [Indexed: 11/25/2022] Open
Abstract
The present study was conducted to predict survival time in patients with diffuse large B-cell lymphoma, DLBCL, based on microarray data using Cox regression model combined with seven dimension reduction methods. This historical cohort included 2042 gene expression measurements from 40 patients with DLBCL. In order to predict survival, a combination of Cox regression model was used with seven methods for dimension reduction or shrinkage including univariate selection, forward stepwise selection, principal component regression, supervised principal component regression, partial least squares regression, ridge regression and Losso. The capacity of predictions was examined by three different criteria including log rank test, prognostic index and deviance. MATLAB r2008a and RKWard software were used for data analysis. Based on our findings, performance of ridge regression was better than other methods. Based on ridge regression coefficients and a given cut point value, 16 genes were selected. By using forward stepwise selection method in Cox regression model, it was indicated that the expression of genes GENE3555X and GENE3807X decreased the survival time (P=0.008 and P=0.003, respectively), whereas the genes GENE3228X and GENE1551X increased survival time (P=0.002 and P<0.001, respectively). This study indicated that ridge regression method had higher capacity than other dimension reduction methods for the prediction of survival time in patients with DLBCL. Furthermore, a combination of statistical methods and microarray data could help to detect influential genes in survival.
Collapse
Affiliation(s)
- Mehri Khoshhali
- Department of Biostatistics & Epidemiology, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran
| | | | | | | | | |
Collapse
|
45
|
Folkersen L, Persson J, Ekstrand J, Agardh HE, Hansson GK, Gabrielsen A, Hedin U, Paulsson-Berne G. Prediction of ischemic events on the basis of transcriptomic and genomic profiling in patients undergoing carotid endarterectomy. Mol Med 2012; 18:669-75. [PMID: 22371308 DOI: 10.2119/molmed.2011.00479] [Citation(s) in RCA: 62] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2011] [Accepted: 02/23/2012] [Indexed: 12/16/2022] Open
Abstract
Classic risk factors, including age, smoking, serum cholesterol, diabetes and blood pressure, constitute the basis of present risk prediction models but fail to identify all individuals at risk. The objective of this study was to investigate if genomic and transcriptional patterns improve prediction of ischemic events in patients with established carotid artery disease. Genotype and gene expression profiles were obtained from carotid plaque tissue (n = 126) and peripheral blood mononuclear cells (n = 97) of patients undergoing carotid endarterectomy. Patients were followed for an average of 44 months, and 25 ischemic events occurred (18 ischemic strokes and 7 myocardial infarctions). Blinded leave-one-out cross-validation on Cox regression coefficients was used to assign gene expression-based risk scores to each patient. When compared with classic risk factors, addition of carotid plaque gene expression-based risk score improved the prediction of future ischemic events from an area under the curve (AUC) of 0.66 to an AUC of 0.79. The inclusion of gene expression risk score from peripheral blood mononuclear cells or from 25 established myocardial infarction risk single nucleotide polymorphisms only exhibited marginal effects on the prediction of ischemic events. Prediction of ischemic events is improved by inclusion of gene expression profiling from carotid endarterectomy tissue compared with prediction on the basis of classic risk markers alone in patients with atherosclerosis. The method may be developed to identify subjects at very high risk of ischemic events.
Collapse
Affiliation(s)
- Lasse Folkersen
- Center for Molecular Medicine, Department of Medicine, Karolinska University Hospital, Karolinska Institutet, Stockholm, Sweden.
| | | | | | | | | | | | | | | |
Collapse
|
46
|
Kammers K, Lang M, Hengstler JG, Schmidt M, Rahnenführer J. Survival models with preclustered gene groups as covariates. BMC Bioinformatics 2011; 12:478. [PMID: 22177110 PMCID: PMC3377939 DOI: 10.1186/1471-2105-12-478] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2011] [Accepted: 12/16/2011] [Indexed: 11/22/2022] Open
Abstract
Background An important application of high dimensional gene expression measurements is the risk prediction and the interpretation of the variables in the resulting survival models. A major problem in this context is the typically large number of genes compared to the number of observations (individuals). Feature selection procedures can generate predictive models with high prediction accuracy and at the same time low model complexity. However, interpretability of the resulting models is still limited due to little knowledge on many of the remaining selected genes. Thus, we summarize genes as gene groups defined by the hierarchically structured Gene Ontology (GO) and include these gene groups as covariates in the hazard regression models. Since expression profiles within GO groups are often heterogeneous, we present a new method to obtain subgroups with coherent patterns. We apply preclustering to genes within GO groups according to the correlation of their gene expression measurements. Results We compare Cox models for modeling disease free survival times of breast cancer patients. Besides classical clinical covariates we consider genes, GO groups and preclustered GO groups as additional genomic covariates. Survival models with preclustered gene groups as covariates have similar prediction accuracy as models built only with single genes or GO groups. Conclusions The preclustering information enables a more detailed analysis of the biological meaning of covariates selected in the final models. Compared to models built only with single genes there is additional functional information contained in the GO annotation, and compared to models using GO groups as covariates the preclustering yields coherent representative gene expression profiles.
Collapse
Affiliation(s)
- Kai Kammers
- Department of Statistics, TU Dortmund University, Dortmund, Germany.
| | | | | | | | | |
Collapse
|
47
|
Obulkasim A, Meijer GA, van de Wiel MA. Stepwise classification of cancer samples using clinical and molecular data. BMC Bioinformatics 2011; 12:422. [PMID: 22034839 PMCID: PMC3221726 DOI: 10.1186/1471-2105-12-422] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2011] [Accepted: 10/28/2011] [Indexed: 11/10/2022] Open
Abstract
Background Combining clinical and molecular data types may potentially improve prediction accuracy of a classifier. However, currently there is a shortage of effective and efficient statistical and bioinformatic tools for true integrative data analysis. Existing integrative classifiers have two main disadvantages: First, coarse combination may lead to subtle contributions of one data type to be overshadowed by more obvious contributions of the other. Second, the need to measure both data types for all patients may be both unpractical and (cost) inefficient. Results We introduce a novel classification method, a stepwise classifier, which takes advantage of the distinct classification power of clinical data and high-dimensional molecular data. We apply classification algorithms to two data types independently, starting with the traditional clinical risk factors. We only turn to relatively expensive molecular data when the uncertainty of prediction result from clinical data exceeds a predefined limit. Experimental results show that our approach is adaptive: the proportion of samples that needs to be re-classified using molecular data depends on how much we expect the predictive accuracy to increase when re-classifying those samples. Conclusions Our method renders a more cost-efficient classifier that is at least as good, and sometimes better, than one based on clinical or molecular data alone. Hence our approach is not just a classifier that minimizes a particular loss function. Instead, it aims to be cost-efficient by avoiding molecular tests for a potentially large subgroup of individuals; moreover, for these individuals a test result would be quickly available, which may lead to reduced waiting times (for diagnosis) and hence lower the patients distress. Stepwise classification is implemented in R-package stepwiseCM and available at the Bioconductor website.
Collapse
Affiliation(s)
- Askar Obulkasim
- Department of Epidemiology and Biostatistics, VU University Medical Center, Amsterdam, The Netherlands.
| | | | | |
Collapse
|
48
|
Binder H, Porzelius C, Schumacher M. An overview of techniques for linking high-dimensional molecular data to time-to-event endpoints by risk prediction models. Biom J 2011; 53:170-89. [PMID: 21328602 DOI: 10.1002/bimj.201000152] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2010] [Revised: 12/22/2010] [Accepted: 12/23/2010] [Indexed: 11/07/2022]
Abstract
Analysis of molecular data promises identification of biomarkers for improving prognostic models, thus potentially enabling better patient management. For identifying such biomarkers, risk prediction models can be employed that link high-dimensional molecular covariate data to a clinical endpoint. In low-dimensional settings, a multitude of statistical techniques already exists for building such models, e.g. allowing for variable selection or for quantifying the added value of a new biomarker. We provide an overview of techniques for regularized estimation that transfer this toward high-dimensional settings, with a focus on models for time-to-event endpoints. Techniques for incorporating specific covariate structure are discussed, as well as techniques for dealing with more complex endpoints. Employing gene expression data from patients with diffuse large B-cell lymphoma, some typical modeling issues from low-dimensional settings are illustrated in a high-dimensional application. First, the performance of classical stepwise regression is compared to stage-wise regression, as implemented by a component-wise likelihood-based boosting approach. A second issues arises, when artificially transforming the response into a binary variable. The effects of the resulting loss of efficiency and potential bias in a high-dimensional setting are illustrated, and a link to competing risks models is provided. Finally, we discuss conditions for adequately quantifying the added value of high-dimensional gene expression measurements, both at the stage of model fitting and when performing evaluation.
Collapse
Affiliation(s)
- Harald Binder
- Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg, Stefan-Meier-Str. 26, 79104 Freiburg, Germany.
| | | | | |
Collapse
|
49
|
Simon RM, Subramanian J, Li MC, Menezes S. Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data. Brief Bioinform 2011; 12:203-14. [PMID: 21324971 DOI: 10.1093/bib/bbr001] [Citation(s) in RCA: 146] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Developments in whole genome biotechnology have stimulated statistical focus on prediction methods. We review here methodology for classifying patients into survival risk groups and for using cross-validation to evaluate such classifications. Measures of discrimination for survival risk models include separation of survival curves, time-dependent ROC curves and Harrell's concordance index. For high-dimensional data applications, however, computing these measures as re-substitution statistics on the same data used for model development results in highly biased estimates. Most developments in methodology for survival risk modeling with high-dimensional data have utilized separate test data sets for model evaluation. Cross-validation has sometimes been used for optimization of tuning parameters. In many applications, however, the data available are too limited for effective division into training and test sets and consequently authors have often either reported re-substitution statistics or analyzed their data using binary classification methods in order to utilize familiar cross-validation. In this article we have tried to indicate how to utilize cross-validation for the evaluation of survival risk models; specifically how to compute cross-validated estimates of survival distributions for predicted risk groups and how to compute cross-validated time-dependent ROC curves. We have also discussed evaluation of the statistical significance of a survival risk model and evaluation of whether high-dimensional genomic data adds predictive accuracy to a model based on standard covariates alone.
Collapse
Affiliation(s)
- Richard M Simon
- Biometric Research Branch, US National Cancer Institute, Bethesda, MD 20892-7434, USA.
| | | | | | | |
Collapse
|
50
|
Bøvelstad HM, Borgan O. Assessment of evaluation criteria for survival prediction from genomic data. Biom J 2011; 53:202-16. [PMID: 21308723 DOI: 10.1002/bimj.201000048] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2010] [Revised: 10/13/2010] [Accepted: 11/11/2010] [Indexed: 11/10/2022]
Abstract
Survival prediction from high-dimensional genomic data is dependent on a proper regularization method. With an increasing number of such methods proposed in the literature, comparative studies are called for and some have been performed. However, there is currently no consensus on which prediction assessment criterion should be used for time-to-event data. Without a firm knowledge about whether the choice of evaluation criterion may affect the conclusions made as to which regularization method performs best, these comparative studies may be of limited value. In this paper, four evaluation criteria are investigated: the log-rank test for two groups, the area under the time-dependent ROC curve (AUC), an R²-measure based on the Cox partial likelihood, and an R²-measure based on the Brier score. The criteria are compared according to how they rank six widely used regularization methods that are based on the Cox regression model, namely univariate selection, principal components regression (PCR), supervised PCR, partial least squares regression, ridge regression, and the lasso. Based on our application to three microarray gene expression data sets, we find that the results obtained from the widely used log-rank test deviate from the other three criteria studied. For future studies, where one also might want to include non-likelihood or non-model-based regularization methods, we argue in favor of AUC and the R²-measure based on the Brier score, as these do not suffer from the arbitrary splitting into two groups nor depend on the Cox partial likelihood.
Collapse
Affiliation(s)
- Hege M Bøvelstad
- Department of Mathematics, University of Oslo, PO Box 1053, Blindern, Oslo NO-0316, Norway.
| | | |
Collapse
|