1
|
O'Donnell A, Cronin M, Moghaddam S, Wolsztynski E. Pre-operative prediction of BCR-free survival with mRNA variables in prostate cancer. PLoS One 2024; 19:e0311162. [PMID: 39352906 PMCID: PMC11444391 DOI: 10.1371/journal.pone.0311162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Accepted: 09/13/2024] [Indexed: 10/04/2024] Open
Abstract
Technological innovation yielded opportunities to obtain mRNA expression data for prostate cancer (PCa) patients even prior to biopsy, which can be used in a precision medicine approach to treatment decision-making. This can apply in particular to predict the risk of, and time to biochemical recurrence (BCR). Most mRNA-based models currently proposed to this end are designed for risk classification and post-operative prediction. Effective pre-operative prediction would facilitate early treatment decision-making, in particular by indicating more appropriate therapeutic pathways for patient profiles who would likely not benefit from a systematic prostatectomy regime. The aim of this study is to investigate the possibility to leverage mRNA information pre-operatively for BCR-free survival prediction. To do this, we considered time-to-event machine learning (ML) methodologies, rather than classification models at a specific survival horizon. We retrospectively analysed a cohort of 135 patients with clinical follow-up data and mRNA information comprising over 26,000 features (data accessible at NCBI GEO database, accession GSE21032). The performance of ML models including random survival forest, boosted and regularised Cox models were assessed, in terms of model discrimination, calibration, and predictive accuracy for overall, 3-year and 5-year survival, aligning with common clinical endpoints. Results showed that the inclusion of mRNA information could yield a gain in performance for pre-operative BCR prediction. ML-based time-to-event models significantly outperformed reference nomograms that used only routine clinical information with respect to all metrics considered. We believe this is the first study proposing pre-operative transcriptomics models for BCR prediction in PCa. External validation of these findings, including confirmation of the mRNA variables identified as potential key predictors in this study, could pave the way for pre-operative precision nomograms to facilitate timely personalised clinical decision-making.
Collapse
Affiliation(s)
- Autumn O'Donnell
- School of Mathematical Sciences, Western Gateway Building, University College Cork, Cork, Ireland
| | - Michael Cronin
- School of Mathematical Sciences, Western Gateway Building, University College Cork, Cork, Ireland
| | - Shirin Moghaddam
- Department of Mathematics and Statistics (MACSI), University of Limerick, Limerick, Ireland
- Insight SFI Centre for Data Analytics, Dublin, Ireland
- Limerick Digital Cancer Research Centre (LDCRC), University of Limerick, Limerick, Ireland
| | - Eric Wolsztynski
- School of Mathematical Sciences, Western Gateway Building, University College Cork, Cork, Ireland
- Insight SFI Centre for Data Analytics, Dublin, Ireland
| |
Collapse
|
2
|
Zhu G, Wang X, Wang Y, Huang T, Zhang X, He J, Shi N, Chen J, Zhang J, Zhang M, Li J. Comparative transcriptomic study on the ovarian cancer between chicken and human. Poult Sci 2024; 103:104021. [PMID: 39002367 PMCID: PMC11298922 DOI: 10.1016/j.psj.2024.104021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2024] [Revised: 06/05/2024] [Accepted: 06/19/2024] [Indexed: 07/15/2024] Open
Abstract
The laying hen is the spontaneous model of ovarian tumor. A comprehensive comparison based on RNA-seq from hens and women may shed light on the molecular mechanisms of ovarian cancer. We performed next-generation sequencing of microRNA and mRNA expression profiles in 9 chicken ovarian cancers and 4 normal ovaries, which has been deposited in GSE246604. Together with 6 public datasets (GSE21706, GSE40376, GSE18520, GSE27651, GSE66957, TCGA-OV), we conducted a comparative transcriptomics study between chicken and human. In the present study, miR-451, miR-2188-5p, and miR-10b-5p were differentially expressed in normal ovaries, early- and late-stage ovarian cancers. We also disclosed 499 up-regulated genes and 1,061 down-regulated genes in chicken ovarian cancer. The molecular signals from 9 cancer hallmarks, 25 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, and 369 Gene Ontology (GO) pathways exhibited abnormalities in ovarian cancer compared to normal ovaries via Gene Set Enrichment Analysis (GSEA). In the comparative analysis across species, we have uncovered the conservation of 5 KEGG and 76 GO pathways between chicken and human including the mismatch repair and ECM receptor interaction pathways. Moreover, a total of 174 genes contributed to the core enrichment for these KEGG and GO pathways were identified. Among these genes, the 22 genes were found to be associated with overall survival in patients with ovarian cancer. In general, we revealed the microRNA profiles of ovarian cancers in hens and updated the mRNA profiles previously derived from microarrays. And we also disclosed the molecular pathways and core genes of ovarian cancer shared between hens and women, which informs model animal studies and gene-targeted drug development.
Collapse
Affiliation(s)
- Guoqiang Zhu
- Key laboratory of Bio-resources and Eco-environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610065, China; Animal Disease Prevention and Food Safety Key Laboratory of Sichuan Province, College of Life Sciences, Sichuan University, Chengdu, China
| | - Xinglong Wang
- Key laboratory of Bio-resources and Eco-environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610065, China; Animal Disease Prevention and Food Safety Key Laboratory of Sichuan Province, College of Life Sciences, Sichuan University, Chengdu, China
| | - Yajun Wang
- Key laboratory of Bio-resources and Eco-environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610065, China; Animal Disease Prevention and Food Safety Key Laboratory of Sichuan Province, College of Life Sciences, Sichuan University, Chengdu, China
| | - Tianjiao Huang
- Key laboratory of Bio-resources and Eco-environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610065, China; Animal Disease Prevention and Food Safety Key Laboratory of Sichuan Province, College of Life Sciences, Sichuan University, Chengdu, China
| | - Xiao Zhang
- Key laboratory of Bio-resources and Eco-environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610065, China; Animal Disease Prevention and Food Safety Key Laboratory of Sichuan Province, College of Life Sciences, Sichuan University, Chengdu, China
| | - Jiliang He
- Key laboratory of Bio-resources and Eco-environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610065, China; Animal Disease Prevention and Food Safety Key Laboratory of Sichuan Province, College of Life Sciences, Sichuan University, Chengdu, China
| | - Ningkun Shi
- Key laboratory of Bio-resources and Eco-environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610065, China; Animal Disease Prevention and Food Safety Key Laboratory of Sichuan Province, College of Life Sciences, Sichuan University, Chengdu, China
| | - Juntao Chen
- Key laboratory of Bio-resources and Eco-environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610065, China; Animal Disease Prevention and Food Safety Key Laboratory of Sichuan Province, College of Life Sciences, Sichuan University, Chengdu, China
| | - Jiannan Zhang
- Key laboratory of Bio-resources and Eco-environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610065, China; Animal Disease Prevention and Food Safety Key Laboratory of Sichuan Province, College of Life Sciences, Sichuan University, Chengdu, China
| | - Mao Zhang
- Division of Vascular Surgery, Sichuan Academy of Medical Sciences & Sichuan Provincial People's Hospital, School of Medicine, University of Electronic Science and Technology of China, Chengdu, 610072, China
| | - Juan Li
- Key laboratory of Bio-resources and Eco-environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610065, China; Animal Disease Prevention and Food Safety Key Laboratory of Sichuan Province, College of Life Sciences, Sichuan University, Chengdu, China.
| |
Collapse
|
3
|
Kim SY. GNN-surv: Discrete-Time Survival Prediction Using Graph Neural Networks. Bioengineering (Basel) 2023; 10:1046. [PMID: 37760148 PMCID: PMC10525217 DOI: 10.3390/bioengineering10091046] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2023] [Revised: 08/31/2023] [Accepted: 09/04/2023] [Indexed: 09/29/2023] Open
Abstract
Survival prediction models play a key role in patient prognosis and personalized treatment. However, their accuracy can be improved by incorporating patient similarity networks, which uncover complex data patterns. Our study uses Graph Neural Networks (GNNs) to enhance discrete-time survival predictions (GNN-surv) by leveraging relationships in these networks. We build these networks using cancer patients' genomic and clinical data and train various GNN models on them, integrating Logistic Hazard and PMF survival models. GNN-surv models exhibit superior performance in survival prediction across two urologic cancer datasets, outperforming traditional MLP models. They maintain robustness and effectiveness under varying graph construction hyperparameter μ values, with performance boosts of up to 14.6% and 7.9% in the time-dependent concordance index and reductions in the integrated brier score of 26.7% and 24.1% in the BLCA and KIRC datasets, respectively. Notably, these models also maintain their effectiveness across three different types of GNN models, suggesting potential adaptability to other cancer datasets. The superior performance of our GNN-surv models underscores their wide applicability in the fields of oncology and personalized medicine, providing clinicians with a more accurate tool for patient prognosis and personalized treatment planning. Future studies can further optimize these models by incorporating other survival models or additional data modalities.
Collapse
Affiliation(s)
- So Yeon Kim
- Department of Artificial Intelligence, Ajou University, Suwon 16499, Republic of Korea;
- Department of Software and Computer Engineering, Ajou University, Suwon 16499, Republic of Korea
| |
Collapse
|
4
|
Huang RH, Hong YK, Du H, Ke WQ, Lin BB, Li YL. A machine learning framework develops a DNA replication stress model for predicting clinical outcomes and therapeutic vulnerability in primary prostate cancer. J Transl Med 2023; 21:20. [PMID: 36635710 PMCID: PMC9835390 DOI: 10.1186/s12967-023-03872-7] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2022] [Accepted: 01/02/2023] [Indexed: 01/13/2023] Open
Abstract
Recent studies have identified DNA replication stress as an important feature of advanced prostate cancer (PCa). The identification of biomarkers for DNA replication stress could therefore facilitate risk stratification and help inform treatment options for PCa. Here, we designed a robust machine learning-based framework to comprehensively explore the impact of DNA replication stress on prognosis and treatment in 5 PCa bulk transcriptomic cohorts with a total of 905 patients. Bootstrap resampling-based univariate Cox regression and Boruta algorithm were applied to select a subset of DNA replication stress genes that were more clinically relevant. Next, we benchmarked 7 survival-related machine-learning algorithms for PCa recurrence using nested cross-validation. Multi-omic and drug sensitivity data were also utilized to characterize PCa with various DNA replication stress. We found that the hyperparameter-tuned eXtreme Gradient Boosting model outperformed other tuned models and was therefore used to establish a robust replication stress signature (RSS). RSS demonstrated superior performance over most clinical features and other PCa signatures in predicting PCa recurrence across cohorts. Lower RSS was characterized by enriched metabolism pathways, high androgen activity, and a favorable prognosis. In contrast, higher RSS was significantly associated with TP53, RB1, and PTEN deletion, exhibited increased proliferation and DNA replication stress, and was more immune-suppressive with a higher chance of immunotherapy response. In silico screening identified 13 potential targets (e.g. TOP2A, CDK9, and RRM2) from 2249 druggable targets, and 2 therapeutic agents (irinotecan and topotecan) for RSS-high patients. Additionally, RSS-high patients were more responsive to taxane-based chemotherapy and Poly (ADP-ribose) polymerase inhibitors, whereas RSS-low patients were more sensitive to androgen deprivation therapy. In conclusion, a robust machine-learning framework was used to reveal the great potential of RSS for personalized risk stratification and therapeutic implications in PCa.
Collapse
Affiliation(s)
- Rong-Hua Huang
- Department of Anesthesiology, The First Affiliated Hospital of Jinan University, Guangzhou, 510630, Guangdong, China
| | - Ying-Kai Hong
- Department of Urology, The First Affiliated Hospital of Shantou University Medical College, Shantou, 515000, Guangdong, China
| | - Heng Du
- Department of Secretion, Baoji Central Hospital, Baoji, 721008, Shaanxi, China
| | - Wei-Qi Ke
- Department of Anesthesiology, The First Affiliated Hospital of Shantou University Medical College, Shantou, 515000, Guangdong, China
| | - Bing-Biao Lin
- Department of Urology, Kidney and Urology Center, Pelvic Floor Disorders Center, The Seventh Affiliated Hospital, Sun Yat-Sen University, Shenzhen, 518000, Guangdong, China.
| | - Ya-Lan Li
- Department of Anesthesiology, The First Affiliated Hospital of Jinan University, Guangzhou, 510630, Guangdong, China.
| |
Collapse
|
5
|
Wu X, Shi Y, Wang M, Li A. CAMR: cross-aligned multimodal representation learning for cancer survival prediction. Bioinformatics 2023; 39:btad025. [PMID: 36637188 PMCID: PMC9857974 DOI: 10.1093/bioinformatics/btad025] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2022] [Revised: 12/10/2022] [Accepted: 01/12/2023] [Indexed: 01/14/2023] Open
Abstract
MOTIVATION Accurately predicting cancer survival is crucial for helping clinicians to plan appropriate treatments, which largely improves the life quality of cancer patients and spares the related medical costs. Recent advances in survival prediction methods suggest that integrating complementary information from different modalities, e.g. histopathological images and genomic data, plays a key role in enhancing predictive performance. Despite promising results obtained by existing multimodal methods, the disparate and heterogeneous characteristics of multimodal data cause the so-called modality gap problem, which brings in dramatically diverse modality representations in feature space. Consequently, detrimental modality gaps make it difficult for comprehensive integration of multimodal information via representation learning and therefore pose a great challenge to further improvements of cancer survival prediction. RESULTS To solve the above problems, we propose a novel method called cross-aligned multimodal representation learning (CAMR), which generates both modality-invariant and -specific representations for more accurate cancer survival prediction. Specifically, a cross-modality representation alignment learning network is introduced to reduce modality gaps by effectively learning modality-invariant representations in a common subspace, which is achieved by aligning the distributions of different modality representations through adversarial training. Besides, we adopt a cross-modality fusion module to fuse modality-invariant representations into a unified cross-modality representation for each patient. Meanwhile, CAMR learns modality-specific representations which complement modality-invariant representations and therefore provides a holistic view of the multimodal data for cancer survival prediction. Comprehensive experiment results demonstrate that CAMR can successfully narrow modality gaps and consistently yields better performance than other survival prediction methods using multimodal data. AVAILABILITY AND IMPLEMENTATION CAMR is freely available at https://github.com/wxq-ustc/CAMR. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xingqi Wu
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China
| | - Yi Shi
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China
| | - Minghui Wang
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China
| | - Ao Li
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China
| |
Collapse
|
6
|
Bayesian ridge regression for survival data based on a vine copula-based prior. ASTA ADVANCES IN STATISTICAL ANALYSIS 2022. [DOI: 10.1007/s10182-022-00466-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
|
7
|
Salehitabar E, Mahdevar M, Valipour Motlagh A, Forootan FS, Feizbakhshan S, Zohrabi D, Peymani M. Identification of genes with high heterogeneity of expression as a predictor of different prognosis and therapeutic responses in colorectal cancer: a challenge and a strategy. Cancer Cell Int 2022; 22:276. [PMID: 36064367 PMCID: PMC9446546 DOI: 10.1186/s12935-022-02694-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2021] [Accepted: 08/25/2022] [Indexed: 11/29/2022] Open
Abstract
Background Molecular heterogeneity is one of the most important concerns in colorectal cancer (CRC), which results in a wide range of therapy responses and patient prognosis. We aimed to identify the genes with high heterogeneity of expression (HHE) and their relation with prognosis and drug resistance. Methods Two cohort studies, the cancer genome atlas (TCGA) and the GSE39582, were used to discover oncogenes genes with HHE. The relationship between identified genes with clinical and genomic characteristics was evaluated based on TCGA data. Also, the GDSC and CCLE data were used for drug resistance and sensitivity. Sixty CRC samples were used to validate the obtained data by RT-qPCR. Results Findings revealed that 132 genes with HHE were found to be up-regulated in both cohorts and were enriched in pathways such as hypoxia, angiogenesis, and metastasis. Forty-nine of selected genes related to clinical and genomic variables, including stage, common mutations, the tumor site, and microsatellite state that were ignored. The expression level of CXCL1, SFTA2, SELE, and SACS as genes with HHE were predicted survival patients, and RT-qPCR results demonstrated that levels of SELE and SACS had HHE in CRC samples. The expression of many identified genes like BGN, MMP7, COL11A1, FAP, KLK10, and TNFRSE11B was associated with resistance to chemotherapy drugs. Conclusions Some genes expression, including SELE, SACS, BGN, KLK10, COL11A1, and TNFRSE11B have an oncogenic function with HHE, and their expression can be used as indicators for differing treatment responses and survival rates in CRC. Supplementary Information The online version contains supplementary material available at 10.1186/s12935-022-02694-9.
Collapse
Affiliation(s)
- Ebrahim Salehitabar
- Department of Biology, Faculty of Science, NourDanesh Institute of Higher Education, Isfahan, Iran
| | - Mohammad Mahdevar
- Cellular, Molecular and Genetics Research Center, Isfahan University of Medical Sciences, Isfahan, Iran.,Medical Genetics Research Center of Genome, Isfahan University of Medical Sciences, Isfahan, Iran
| | - Ali Valipour Motlagh
- Department of Biology, Faculty of Basic Sciences, Shahrekord Branch, Islamic Azad University, Shahrekord, Iran
| | - Farzad Seyed Forootan
- Medical Genetics Research Center of Genome, Isfahan University of Medical Sciences, Isfahan, Iran.,Legal Medicine Research Center, Legal Medicine Organization, Tehran, Iran
| | - Sara Feizbakhshan
- Cellular, Molecular and Genetics Research Center, Isfahan University of Medical Sciences, Isfahan, Iran.,Medical Genetics Research Center of Genome, Isfahan University of Medical Sciences, Isfahan, Iran
| | - Dina Zohrabi
- Department of Biology, Faculty of Science, NourDanesh Institute of Higher Education, Isfahan, Iran.
| | - Maryam Peymani
- Department of Biology, Faculty of Basic Sciences, Shahrekord Branch, Islamic Azad University, Shahrekord, Iran.
| |
Collapse
|
8
|
Hu R, Zhou XJ, Li W. Computational Analysis of High-Dimensional DNA Methylation Data for Cancer Prognosis. J Comput Biol 2022; 29:769-781. [PMID: 35671506 PMCID: PMC9419965 DOI: 10.1089/cmb.2022.0002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Developing cancer prognostic models using multiomics data is a major goal of precision oncology. DNA methylation provides promising prognostic biomarkers, which have been used to predict survival and treatment response in solid tumor or plasma samples. This review article presents an overview of recently published computational analyses on DNA methylation for cancer prognosis. To address the challenges of survival analysis with high-dimensional methylation data, various feature selection methods have been applied to screen a subset of informative markers. Using candidate markers associated with survival, prognostic models either predict risk scores or stratify patients into subtypes. The model's discriminatory power can be assessed by multiple evaluation metrics. Finally, we discuss the limitations of existing studies and present the prospects of applying machine learning algorithms to fully exploit the prognostic value of DNA methylation.
Collapse
Affiliation(s)
- Ran Hu
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California at Los Angeles, Los Angeles, California, USA
- Bioinformatics Interdepartmental Graduate Program, University of California at Los Angeles, Los Angeles, California, USA
- Institute for Quantitative & Computational Biosciences, University of California at Los Angeles, Los Angeles, California, USA
| | - Xianghong Jasmine Zhou
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California at Los Angeles, Los Angeles, California, USA
- Institute for Quantitative & Computational Biosciences, University of California at Los Angeles, Los Angeles, California, USA
| | - Wenyuan Li
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California at Los Angeles, Los Angeles, California, USA
- Institute for Quantitative & Computational Biosciences, University of California at Los Angeles, Los Angeles, California, USA
| |
Collapse
|
9
|
Survival Risk Prediction of Esophageal Cancer Based on the Kohonen Network Clustering Algorithm and Kernel Extreme Learning Machine. MATHEMATICS 2022. [DOI: 10.3390/math10091367] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
Accurate prediction of the survival risk level of patients with esophageal cancer is significant for the selection of appropriate treatment methods. It contributes to improving the living quality and survival chance of patients. However, considering that the characteristics of blood index vary with individuals on the basis of their ages, personal habits and living environment etc., a unified artificial intelligence prediction model is not precisely adequate. In order to enhance the precision of the model on the prediction of esophageal cancer survival risk, this study proposes a different model based on the Kohonen network clustering algorithm and the kernel extreme learning machine (KELM), aiming to classifying the tested population into five catergories and provide better efficiency with the use of machine learning. Firstly, the Kohonen network clustering method was used to cluster the patient samples and five types of samples were obtained. Secondly, patients were divided into two risk levels based on 5-year net survival. Then, the Taylor formula was used to expand the theory to analyze the influence of different activation functions on the KELM modeling effect, and conduct experimental verification. RBF was selected as the activation function of the KELM. Finally, the adaptive mutation sparrow search algorithm (AMSSA) was used to optimize the model parameters. The experimental results were compared with the methods of the artificial bee colony optimized support vector machine (ABC-SVM), the three layers of random forest (TLRF), the gray relational analysis–particle swarm optimization support vector machine (GP-SVM) and the mixed-effects Cox model (Cox-LMM). The results showed that the prediction model proposed in this study had certain advantages in terms of prediction accuracy and running time, and could provide support for medical personnel to choose the treatment mode of esophageal cancer patients.
Collapse
|
10
|
Cottin A, Pecuchet N, Zulian M, Guilloux A, Katsahian S. IDNetwork: A deep illness‐death network based on multi‐state event history process for disease prognostication. Stat Med 2022; 41:1573-1598. [DOI: 10.1002/sim.9310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Revised: 10/28/2021] [Accepted: 12/17/2021] [Indexed: 11/12/2022]
Affiliation(s)
- Aziliz Cottin
- Healthcare and Life Sciences Research Dassault Systemes Velizy‐Villacoublay France
| | - Nicolas Pecuchet
- Healthcare and Life Sciences Research Dassault Systemes Velizy‐Villacoublay France
| | - Marine Zulian
- Healthcare and Life Sciences Research Dassault Systemes Velizy‐Villacoublay France
| | - Agathe Guilloux
- CNRS Université Paris‐Saclay Paris France
- Laboratoire de Mathématiques et Modélisation d'Evry Université d'Evry Evry‐Courcouronnes France
| | - Sandrine Katsahian
- AP‐HP Hôpital Européen Georges Pompidou, Unité de Recherche Clinique, APHP Centre Paris France
- Inserm Centre d'Investigation Clinique 1418 (CIC1418) Epidémiologie Clinique Paris France
- Inserm Centre de recherche des Cordeliers, Sorbonne Université, Université de Paris Paris France
- HeKA, INRIA PARIS Paris France
| |
Collapse
|
11
|
Development and validation of a RNAseq signature for prognostic stratification in endometrial cancer. Gynecol Oncol 2022; 164:596-606. [PMID: 35033379 DOI: 10.1016/j.ygyno.2022.01.005] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2021] [Revised: 12/06/2021] [Accepted: 01/03/2022] [Indexed: 12/13/2022]
Abstract
BACKGROUND Despite recent advances in endometrial carcinoma (EC) molecular characterization, its prognostication remains challenging. We aimed to assess whether RNAseq could stratify EC patient prognosis beyond current classification systems. METHODS A prognostic signature was identified using a LASSO-penalized Cox model trained on TCGA (N = 543 patients). A clinically applicable polyA-RNAseq-based work-flow was developed for validation of the signature in a cohort of stage I-IV patients treated in two Hospitals [2010-2017]. Model performances were evaluated using time-dependent ROC curves (prediction of disease-specific-survival (DSS)). The additional value of the RNAseq signature was evaluated by multivariable Cox model, adjusted on high-risk prognostic group (2021 ESGO-ESTRO-ESP guidelines: non-endometrioid histology or stage III-IVA orTP53-mutated molecular subgroup). RESULTS Among 209 patients included in the external validation cohort, 61 (30%), 10 (5%), 52 (25%), and 82 (40%), had mismatch repair-deficient, POLE-mutated, TP53-mutated tumors, and tumors with no specific molecular profile, respectively. The 38-genes signature accurately predicted DSS (AUC = 0.80). Most disease-related deaths occurred in high-risk patients (5-years DSS = 78% (95% CI = [68%-89%]) versus 99% [97%-100%] in patients without high-risk). A composite classifier accounting for the TP53-mutated subgroup and the RNAseq signature identified three classes independently associated with DSS: RNAseq-good prognosis (reference, 5-years DSS = 99%), non-TP53 tumors but with RNAseq-poor prognosis (adjusted-hazard ratio (aHR) = 5.75, 95% CI[1.14-29.0]), and TP53-mutated subgroup (aHR = 5.64 [1.12-28.3]). The model accounting for the high-risk group and the composite classifier predicted DSS with AUC = 0.84, versus AUC = 0.76 without (p = 0.01). CONCLUSION RNA-seq profiling can provide an additional prognostic information to established classification systems, and warrants validation for potential RNAseq-based therapeutic strategies in EC.
Collapse
|
12
|
Signorelli M, Spitali P, Szigyarto CAK, Tsonaka R. Penalized regression calibration: A method for the prediction of survival outcomes using complex longitudinal and high-dimensional data. Stat Med 2021; 40:6178-6196. [PMID: 34464990 PMCID: PMC9293191 DOI: 10.1002/sim.9178] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Revised: 08/10/2021] [Accepted: 08/10/2021] [Indexed: 11/18/2022]
Abstract
Longitudinal and high‐dimensional measurements have become increasingly common in biomedical research. However, methods to predict survival outcomes using covariates that are both longitudinal and high‐dimensional are currently missing. In this article, we propose penalized regression calibration (PRC), a method that can be employed to predict survival in such situations. PRC comprises three modeling steps: First, the trajectories described by the longitudinal predictors are flexibly modeled through the specification of multivariate mixed effects models. Second, subject‐specific summaries of the longitudinal trajectories are derived from the fitted mixed models. Third, the time to event outcome is predicted using the subject‐specific summaries as covariates in a penalized Cox model. To ensure a proper internal validation of the fitted PRC models, we furthermore develop a cluster bootstrap optimism correction procedure that allows to correct for the optimistic bias of apparent measures of predictiveness. PRC and the CBOCP are implemented in the R package pencal, available from CRAN. After studying the behavior of PRC via simulations, we conclude by illustrating an application of PRC to data from an observational study that involved patients affected by Duchenne muscular dystrophy, where the goal is predict time to loss of ambulation using longitudinal blood biomarkers.
Collapse
Affiliation(s)
- Mirko Signorelli
- Mathematical Institute, Leiden University, Leiden, The Netherlands
| | - Pietro Spitali
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | | | | | - Roula Tsonaka
- Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands
| |
Collapse
|
13
|
Bertrand F, Maumy-Bertrand M. Fitting and Cross-Validating Cox Models to Censored Big Data With Missing Values Using Extensions of Partial Least Squares Regression Models. Front Big Data 2021; 4:684794. [PMID: 34790895 PMCID: PMC8591675 DOI: 10.3389/fdata.2021.684794] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Accepted: 10/07/2021] [Indexed: 11/22/2022] Open
Abstract
Fitting Cox models in a big data context -on a massive scale in terms of volume, intensity, and complexity exceeding the capacity of usual analytic tools-is often challenging. If some data are missing, it is even more difficult. We proposed algorithms that were able to fit Cox models in high dimensional settings using extensions of partial least squares regression to the Cox models. Some of them were able to cope with missing data. We were recently able to extend our most recent algorithms to big data, thus allowing to fit Cox model for big data with missing values. When cross-validating standard or extended Cox models, the commonly used criterion is the cross-validated partial loglikelihood using a naive or a van Houwelingen scheme -to make efficient use of the death times of the left out data in relation to the death times of all the data. Quite astonishingly, we will show, using a strong simulation study involving three different data simulation algorithms, that these two cross-validation methods fail with the extensions, either straightforward or more involved ones, of partial least squares regression to the Cox model. This is quite an interesting result for at least two reasons. Firstly, several nice features of PLS based models, including regularization, interpretability of the components, missing data support, data visualization thanks to biplots of individuals and variables -and even parsimony or group parsimony for Sparse partial least squares or sparse group SPLS based models, account for a common use of these extensions by statisticians who usually select their hyperparameters using cross-validation. Secondly, they are almost always featured in benchmarking studies to assess the performance of a new estimation technique used in a high dimensional or big data context and often show poor statistical properties. We carried out a vast simulation study to evaluate more than a dozen of potential cross-validation criteria, either AUC or prediction error based. Several of them lead to the selection of a reasonable number of components. Using these newly found cross-validation criteria to fit extensions of partial least squares regression to the Cox model, we performed a benchmark reanalysis that showed enhanced performances of these techniques. In addition, we proposed sparse group extensions of our algorithms and defined a new robust measure based on the Schmid score and the R coefficient of determination for least absolute deviation: the integrated R Schmid Score weighted. The R-package used in this article is available on the CRAN, http://cran.r-project.org/web/packages/plsRcox/index.html. The R package bigPLS will soon be available on the CRAN and, until then, is available on Github https://github.com/fbertran/bigPLS.
Collapse
Affiliation(s)
- Frédéric Bertrand
- LIST3N, Université de Technologie de Troyes, Troyes, France
- IRMA, CNRS UMR 7501, Labex IRMIA, Université de Strasbourg, Strasbourg, France
| | - Myriam Maumy-Bertrand
- LIST3N, Université de Technologie de Troyes, Troyes, France
- IRMA, CNRS UMR 7501, Labex IRMIA, Université de Strasbourg, Strasbourg, France
| |
Collapse
|
14
|
Pungpapong V. Incorporating biological networks into high-dimensional Bayesian survival analysis using an ICM/M algorithm. J Bioinform Comput Biol 2021; 19:2150027. [PMID: 34693885 DOI: 10.1142/s021972002150027x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The Cox proportional hazards model has been widely used in cancer genomic research that aims to identify genes from high-dimensional gene expression space associated with the survival time of patients. With the increase in expertly curated biological pathways, it is challenging to incorporate such complex networks in fitting a high-dimensional Cox model. This paper considers a Bayesian framework that employs the Ising prior to capturing relations among genes represented by graphs. A spike-and-slab prior is also assigned to each of the coefficients for the purpose of variable selection. The iterated conditional modes/medians (ICM/M) algorithm is proposed for the implementation for Cox models. The ICM/M estimates hyperparameters using conditional modes and obtains coefficients through conditional medians. This procedure produces some coefficients that are exactly zero, making the model more interpretable. Comparisons of the ICM/M and other regularized Cox models were carried out with both simulated and real data. Compared to lasso, adaptive lasso, elastic net, and DegreeCox, the ICM/M yielded more parsimonious models with consistent variable selection. The ICM/M model also provided a smaller number of false positives than the other methods and showed promising results in terms of predictive accuracy. In terms of computing times among the network-aware methods, the ICM/M algorithm is substantially faster than DegreeCox even when incorporating a large complex network. The implementation of the ICM/M algorithm for Cox regression model is provided in R package icmm, available on the Comprehensive R Archive Network (CRAN).
Collapse
Affiliation(s)
- Vitara Pungpapong
- Department of Statistics, Greater Data Science Lab, Chulalongkorn Business School, Chulalongkorn University, Phyathai Road, Pathumwan, Bangkok, Thailand
| |
Collapse
|
15
|
Emura T, Hsu WC, Chou WC. A survival tree based on stabilized score tests for high-dimensional covariates. J Appl Stat 2021; 50:264-290. [PMID: 36698545 PMCID: PMC9870022 DOI: 10.1080/02664763.2021.1990224] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
A survival tree can classify subjects into different survival prognostic groups. However, when data contains high-dimensional covariates, the two popular classification trees exhibit fatal drawbacks. The logrank tree is unstable and tends to have false nodes; the conditional inference tree is difficult to interpret the adjusted P-value for high-dimensional tests. Motivated by these problems, we propose a new survival tree based on the stabilized score tests. We propose a novel matrix-based algorithm in order to tests a number of nodes simultaneously via stabilized score tests. We propose a recursive partitioning algorithm to construct a survival tree and develop our original R package uni.survival.tree (https://cran.r-project.org/package=uni.survival.tree) for implementation. Simulations are performed to demonstrate the superiority of the proposed method over the existing methods. The lung cancer data analysis demonstrates the usefulness of the proposed method.
Collapse
Affiliation(s)
- Takeshi Emura
- Biostatistics Center, Kurume University, Kurume, Japan, Takeshi Emura Biostatistics Center, Kurume University, 67 Asahi-machi, Kurume, Japan
| | - Wei-Chern Hsu
- Graduate Institute of Statistics, National Central University, Taoyuan, Taiwan
| | - Wen-Chi Chou
- Department of Hematology and Oncology, Chang Gung Memorial Hospital and College of Medicine, Chang Gung University, Taoyuan, Taiwan
| |
Collapse
|
16
|
A comparative study of forest methods for time-to-event data: variable selection and predictive performance. BMC Med Res Methodol 2021; 21:193. [PMID: 34563138 PMCID: PMC8465777 DOI: 10.1186/s12874-021-01386-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2021] [Accepted: 09/02/2021] [Indexed: 11/17/2022] Open
Abstract
Background As a hot method in machine learning field, the forests approach is an attractive alternative approach to Cox model. Random survival forests (RSF) methodology is the most popular survival forests method, whereas its drawbacks exist such as a selection bias towards covariates with many possible split points. Conditional inference forests (CIF) methodology is known to reduce the selection bias via a two-step split procedure implementing hypothesis tests as it separates the variable selection and splitting, but its computation costs too much time. Random forests with maximally selected rank statistics (MSR-RF) methodology proposed recently seems to be a great improvement on RSF and CIF. Methods In this paper we used simulation study and real data application to compare prediction performances and variable selection performances among three survival forests methods, including RSF, CIF and MSR-RF. To evaluate the performance of variable selection, we combined all simulations to calculate the frequency of ranking top of the variable importance measures of the correct variables, where higher frequency means better selection ability. We used Integrated Brier Score (IBS) and c-index to measure the prediction accuracy of all three methods. The smaller IBS value, the greater the prediction. Results Simulations show that three forests methods differ slightly in prediction performance. MSR-RF and RSF might perform better than CIF when there are only continuous or binary variables in the datasets. For variable selection performance, When there are multiple categorical variables in the datasets, the selection frequency of RSF seems to be lowest in most cases. MSR-RF and CIF have higher selection rates, and CIF perform well especially with the interaction term. The fact that correlation degree of the variables has little effect on the selection frequency indicates that three forest methods can handle data with correlation. When there are only continuous variables in the datasets, MSR-RF perform better. When there are only binary variables in the datasets, RSF and MSR-RF have more advantages than CIF. When the variable dimension increases, MSR-RF and RSF seem to be more robustthan CIF Conclusions All three methods show advantages in prediction performances and variable selection performances under different situations. The recent proposed methodology MSR-RF possess practical value and is well worth popularizing. It is important to identify the appropriate method in real use according to the research aim and the nature of covariates. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-021-01386-8.
Collapse
|
17
|
Spirko-Burns L, Devarajan K. Supervised Dimension Reduction for Large-Scale "Omics" Data With Censored Survival Outcomes Under Possible Non-Proportional Hazards. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2032-2044. [PMID: 31940547 DOI: 10.1109/tcbb.2020.2965934] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The past two decades have witnessed significant advances in high-throughput "omics" technologies such as genomics, proteomics, metabolomics, transcriptomics and radiomics. These technologies have enabled simultaneous measurement of the expression levels of tens of thousands of features from individual patient samples and have generated enormous amounts of data that require analysis and interpretation. One specific area of interest has been in studying the relationship between these features and patient outcomes, such as overall and recurrence-free survival, with the goal of developing a predictive "omics" profile. Large-scale studies often suffer from the presence of a large fraction of censored observations and potential time-varying effects of features, and methods for handling them have been lacking. In this paper, we propose supervised methods for feature selection and survival prediction that simultaneously deal with both issues. Our approach utilizes continuum power regression (CPR) - a framework that includes a variety of regression methods - in conjunction with the parametric or semi-parametric accelerated failure time (AFT) model. Both CPR and AFT fall within the linear models framework and, unlike black-box models, the proposed prognostic index has a simple yet useful interpretation. We demonstrate the utility of our methods using simulated and publicly available cancer genomics data.
Collapse
|
18
|
He X, Sun X, Shao Y. Network-based survival analysis to discover target genes for developing cancer immunotherapies and predicting patient survival. J Appl Stat 2021; 48:1352-1373. [PMID: 35444359 DOI: 10.1080/02664763.2020.1812543] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Recently, cancer immunotherapies have been life-savers, however, only a fraction of treated patients have durable responses. Consequently, statistical methods that enable the discovery of target genes for developing new treatments and predicting patient survival are of importance. This paper introduced a network-based survival analysis method and applied it to identify candidate genes as possible targets for developing new treatments. RNA-seq data from a mouse study was used to select differentially expressed genes, which were then translated to those in humans. We constructed a gene network and identified gene clusters using a training set of 310 human gliomas. Then we conducted gene set enrichment analysis to select the gene clusters with significant biological function. A penalized Cox model was built to identify a small set of candidate genes to predict survival. An independent set of 690 human glioma samples was used to evaluate predictive accuracy of the survival model. The areas under time-dependent ROC curves in both the training and validation sets are more than 90%, indicating strong association between selected genes and patient survival. Consequently, potential biomedical interventions targeting these genes might be able to alter their expressions and prolong patient survival.
Collapse
|
19
|
Spooner A, Chen E, Sowmya A, Sachdev P, Kochan NA, Trollor J, Brodaty H. A comparison of machine learning methods for survival analysis of high-dimensional clinical data for dementia prediction. Sci Rep 2020; 10:20410. [PMID: 33230128 PMCID: PMC7683682 DOI: 10.1038/s41598-020-77220-w] [Citation(s) in RCA: 96] [Impact Index Per Article: 19.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2020] [Accepted: 11/05/2020] [Indexed: 12/22/2022] Open
Abstract
Data collected from clinical trials and cohort studies, such as dementia studies, are often high-dimensional, censored, heterogeneous and contain missing information, presenting challenges to traditional statistical analysis. There is an urgent need for methods that can overcome these challenges to model this complex data. At present there is no cure for dementia and no treatment that can successfully change the course of the disease. Machine learning models that can predict the time until a patient develops dementia are important tools in helping understand dementia risks and can give more accurate results than traditional statistical methods when modelling high-dimensional, heterogeneous, clinical data. This work compares the performance and stability of ten machine learning algorithms, combined with eight feature selection methods, capable of performing survival analysis of high-dimensional, heterogeneous, clinical data. We developed models that predict survival to dementia using baseline data from two different studies. The Sydney Memory and Ageing Study (MAS) is a longitudinal cohort study of 1037 participants, aged 70-90 years, that aims to determine the effects of ageing on cognition. The Alzheimer's Disease Neuroimaging Initiative (ADNI) is a longitudinal study aimed at identifying biomarkers for the early detection and tracking of Alzheimer's disease. Using the concordance index as a measure of performance, our models achieve maximum performance values of 0.82 for MAS and 0.93 For ADNI.
Collapse
Affiliation(s)
- Annette Spooner
- School of Computer Science and Engineering, UNSW Sydney, Sydney, Australia.
| | - Emily Chen
- School of Computer Science and Engineering, UNSW Sydney, Sydney, Australia
| | - Arcot Sowmya
- School of Computer Science and Engineering, UNSW Sydney, Sydney, Australia
| | - Perminder Sachdev
- School of Psychiatry, UNSW Sydney, Sydney, Australia
- Centre for Healthy Brain Ageing (CHeBA), UNSW Sydney, Sydney, Australia
| | - Nicole A Kochan
- Centre for Healthy Brain Ageing (CHeBA), UNSW Sydney, Sydney, Australia
| | - Julian Trollor
- School of Psychiatry, UNSW Sydney, Sydney, Australia
- Centre for Healthy Brain Ageing (CHeBA), UNSW Sydney, Sydney, Australia
- Department of Developmental Disability Neuropsychiatry, School of Psychiatry, UNSW Sydney, Sydney, Australia
| | - Henry Brodaty
- School of Psychiatry, UNSW Sydney, Sydney, Australia
- Centre for Healthy Brain Ageing (CHeBA), UNSW Sydney, Sydney, Australia
| |
Collapse
|
20
|
An efficient algorithm for joint feature screening in ultrahigh-dimensional Cox’s model. Comput Stat 2020. [DOI: 10.1007/s00180-020-01032-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
21
|
Theilhaber J, Chiron M, Dreymann J, Bergstrom D, Pollard J. Construction and optimization of gene expression signatures for prediction of survival in two-arm clinical trials. BMC Bioinformatics 2020; 21:333. [PMID: 32711453 PMCID: PMC7382041 DOI: 10.1186/s12859-020-03655-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2019] [Accepted: 07/13/2020] [Indexed: 11/17/2022] Open
Abstract
Background Gene expression signatures for the prediction of differential survival of patients undergoing anti-cancer therapies are of great interest because they can be used to prospectively stratify patients entering new clinical trials, or to determine optimal treatment for patients in more routine clinical settings. Unlike prognostic signatures however, predictive signatures require training set data from clinical studies with at least two treatment arms. As two-arm studies with gene expression profiling have been rarer than similar one-arm studies, the methodology for constructing and optimizing predictive signatures has been less prominently explored than for prognostic signatures. Results Focusing on two “use cases” of two-arm clinical trials, one for metastatic colorectal cancer (CRC) patients treated with the anti-angiogenic molecule aflibercept, and the other for triple negative breast cancer (TNBC) patients treated with the small molecule iniparib, we present derivation steps and quantitative and graphical tools for the construction and optimization of signatures for the prediction of progression-free survival based on cross-validated multivariate Cox models. This general methodology is organized around two more specific approaches which we have called subtype correlation (subC) and mechanism-of-action (MOA) modeling, each of which leverage a priori knowledge of molecular subtypes of tumors or drug MOA for a given indication. The tools and concepts presented here include the so-called differential log-hazard ratio, the survival scatter plot, the hazard ratio receiver operating characteristic, the area between curves and the patient selection matrix. In the CRC use case for instance, the resulting signature stratifies the patient population into “sensitive” and “relatively-resistant” groups achieving a more than two-fold difference in the aflibercept-to-control hazard ratios across signature-defined patient groups. Through cross-validation and resampling the probability of generalization of the signature to similar CRC data sets is predicted to be high. Conclusions The tools presented here should be of general use for building and using predictive multivariate signatures in oncology and in other therapeutic areas.
Collapse
Affiliation(s)
| | - Marielle Chiron
- Sanofi Oncology, Centre de Recherche de Vitry-Alfortville, 13 Quai Jules Guesde, 94400, Vitry-sur-Seine, France
| | - Jennifer Dreymann
- Sanofi Oncology, Centre de Recherche de Vitry-Alfortville, 13 Quai Jules Guesde, 94400, Vitry-sur-Seine, France
| | | | - Jack Pollard
- Sanofi Oncology, 270 Albany Street, Cambridge, MA, 02139, USA
| |
Collapse
|
22
|
Comparison of Variable Selection Methods for Time-to-Event Data in High-Dimensional Settings. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2020; 2020:6795392. [PMID: 32670394 PMCID: PMC7350178 DOI: 10.1155/2020/6795392] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/15/2020] [Revised: 05/06/2020] [Accepted: 05/20/2020] [Indexed: 12/01/2022]
Abstract
Over the last decades, molecular signatures have become increasingly important in oncology and are opening up a new area of personalized medicine. Nevertheless, biological relevance and statistical tools necessary for the development of these signatures have been called into question in the literature. Here, we investigate six typical selection methods for high-dimensional settings and survival endpoints, including LASSO and some of its extensions, component-wise boosting, and random survival forests (RSF). A resampling algorithm based on data splitting was used on nine high-dimensional simulated datasets to assess selection stability on training sets and the intersection between selection methods. Prognostic performances were evaluated on respective validation sets. Finally, one application on a real breast cancer dataset has been proposed. The false discovery rate (FDR) was high for each selection method, and the intersection between lists of predictors was very poor. RSF selects many more variables than the other methods and thus becomes less efficient on validation sets. Due to the complex correlation structure in genomic data, stability in the selection procedure is generally poor for selected predictors, but can be improved with a higher training sample size. In a very high-dimensional setting, we recommend the LASSO-pcvl method since it outperforms other methods by reducing the number of selected genes and minimizing FDR in most scenarios. Nevertheless, this method still gives a high rate of false positives. Further work is thus necessary to propose new methods to overcome this issue where numerous predictors are present. Pluridisciplinary discussion between clinicians and statisticians is necessary to ensure both statistical and biological relevance of the predictors included in molecular signatures.
Collapse
|
23
|
Mubeen S, Hoyt CT, Gemünd A, Hofmann-Apitius M, Fröhlich H, Domingo-Fernández D. The Impact of Pathway Database Choice on Statistical Enrichment Analysis and Predictive Modeling. Front Genet 2019; 10:1203. [PMID: 31824580 PMCID: PMC6883970 DOI: 10.3389/fgene.2019.01203] [Citation(s) in RCA: 55] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2019] [Accepted: 10/30/2019] [Indexed: 02/04/2023] Open
Abstract
Pathway-centric approaches are widely used to interpret and contextualize -omics data. However, databases contain different representations of the same biological pathway, which may lead to different results of statistical enrichment analysis and predictive models in the context of precision medicine. We have performed an in-depth benchmarking of the impact of pathway database choice on statistical enrichment analysis and predictive modeling. We analyzed five cancer datasets using three major pathway databases and developed an approach to merge several databases into a single integrative one: MPath. Our results show that equivalent pathways from different databases yield disparate results in statistical enrichment analysis. Moreover, we observed a significant dataset-dependent impact on the performance of machine learning models on different prediction tasks. In some cases, MPath significantly improved prediction performance and also reduced the variance of prediction performances. Furthermore, MPath yielded more consistent and biologically plausible results in statistical enrichment analyses. In summary, this benchmarking study demonstrates that pathway database choice can influence the results of statistical enrichment analysis and predictive modeling. Therefore, we recommend the use of multiple pathway databases or integrative ones.
Collapse
Affiliation(s)
- Sarah Mubeen
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
- Bonn-Aachen International Center for IT, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - Charles Tapley Hoyt
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
- Bonn-Aachen International Center for IT, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - André Gemünd
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
| | - Martin Hofmann-Apitius
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
- Bonn-Aachen International Center for IT, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - Holger Fröhlich
- Bonn-Aachen International Center for IT, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - Daniel Domingo-Fernández
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
- Bonn-Aachen International Center for IT, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| |
Collapse
|
24
|
Gene expression based survival prediction for cancer patients-A topic modeling approach. PLoS One 2019; 14:e0224446. [PMID: 31730620 PMCID: PMC6857918 DOI: 10.1371/journal.pone.0224446] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2019] [Accepted: 10/14/2019] [Indexed: 12/21/2022] Open
Abstract
Cancer is one of the leading cause of death, worldwide. Many believe that genomic data will enable us to better predict the survival time of these patients, which will lead to better, more personalized treatment options and patient care. As standard survival prediction models have a hard time coping with the high-dimensionality of such gene expression data, many projects use some dimensionality reduction techniques to overcome this hurdle. We introduce a novel methodology, inspired by topic modeling from the natural language domain, to derive expressive features from the high-dimensional gene expression data. There, a document is represented as a mixture over a relatively small number of topics, where each topic corresponds to a distribution over the words; here, to accommodate the heterogeneity of a patient's cancer, we represent each patient (≈ document) as a mixture over cancer-topics, where each cancer-topic is a mixture over gene expression values (≈ words). This required some extensions to the standard LDA model-e.g., to accommodate the real-valued expression values-leading to our novel discretized Latent Dirichlet Allocation (dLDA) procedure. After using this dLDA to learn these cancer-topics, we can then express each patient as a distribution over a small number of cancer-topics, then use this low-dimensional "distribution vector" as input to a learning algorithm-here, we ran the recent survival prediction algorithm, MTLR, on this representation of the cancer dataset. We initially focus on the METABRIC dataset, which describes each of n = 1,981 breast cancer patients using the r = 49,576 gene expression values, from microarrays. Our results show that our approach (dLDA followed by MTLR) provides survival estimates that are more accurate than standard models, in terms of the standard Concordance measure. We then validate this "dLDA+MTLR" approach by running it on the n = 883 Pan-kidney (KIPAN) dataset, over r = 15,529 gene expression values-here using the mRNAseq modality-and find that it again achieves excellent results. In both cases, we also show that the resulting model is calibrated, using the recent "D-calibrated" measure. These successes, in two different cancer types and expression modalities, demonstrates the generality, and the effectiveness, of this approach. The dLDA+MTLR source code is available at https://github.com/nitsanluke/GE-LDA-Survival.
Collapse
|
25
|
Abstract
AbstractThis paper introduces the paired lasso: a generalisation of the lasso for paired covariate settings. Our aim is to predict a single response from two high-dimensional covariate sets. We assume a one-to-one correspondence between the covariate sets, with each covariate in one set forming a pair with a covariate in the other set. Paired covariates arise, for example, when two transformations of the same data are available. It is often unknown which of the two covariate sets leads to better predictions, or whether the two covariate sets complement each other. The paired lasso addresses this problem by weighting the covariates to improve the selection from the covariate sets and the covariate pairs. It thereby combines information from both covariate sets and accounts for the paired structure. We tested the paired lasso on more than 2000 classification problems with experimental genomics data, and found that for estimating sparse but predictive models, the paired lasso outperforms the standard and the adaptive lasso. The R package is available from cran.
Collapse
|
26
|
Molstad AJ, Hsu L, Sun W. Gaussian process regression for survival time prediction with genome-wide gene expression. Biostatistics 2019; 22:164-180. [PMID: 31292609 DOI: 10.1093/biostatistics/kxz023] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Revised: 04/07/2019] [Accepted: 05/13/2019] [Indexed: 11/14/2022] Open
Abstract
Predicting the survival time of a cancer patient based on his/her genome-wide gene expression remains a challenging problem. For certain types of cancer, the effects of gene expression on survival are both weak and abundant, so identifying non-zero effects with reasonable accuracy is difficult. As an alternative to methods that use variable selection, we propose a Gaussian process accelerated failure time model to predict survival time using genome-wide or pathway-wide gene expression data. Using a Monte Carlo expectation-maximization algorithm, we jointly impute censored log-survival time and estimate model parameters. We demonstrate the performance of our method and its advantage over existing methods in both simulations and real data analysis. The real data that we analyze were collected from 513 patients with kidney renal clear cell carcinoma and include survival time, demographic/clinical variables, and expression of more than 20 000 genes. In addition to the right-censored survival time, our method can also accommodate left-censored or interval-censored outcomes; and it provides a natural way to combine multiple types of high-dimensional -omics data. An R package implementing our method is available in the Supplementary material available at Biostatistics online.
Collapse
Affiliation(s)
- Aaron J Molstad
- Biostatistics Program, Public Health Sciences Division, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., Seattle, WA, USA
| | - Li Hsu
- Biostatistics Program, Public Health Sciences Division, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., Seattle, WA, USA and Department of Biostatistics, University of Washington, 1705 NE Pacific St, Seattle, WA 98195, USA
| | - Wei Sun
- Biostatistics Program, Public Health Sciences Division, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., Seattle, WA, USA and Department of Biostatistics, University of Washington, 1705 NE Pacific St, Seattle, WA 98195, USA and Department of Biostatistics, University of North Carolina at Chapel Hill, 135 Dauer Dr, Chapel Hill, NC 27599, USA
| |
Collapse
|
27
|
Identification and clinical validation of a multigene assay that interrogates the biology of cancer stem cells and predicts metastasis in breast cancer: A retrospective consecutive study. EBioMedicine 2019; 42:352-362. [PMID: 30846393 PMCID: PMC6491379 DOI: 10.1016/j.ebiom.2019.02.036] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2019] [Accepted: 02/18/2019] [Indexed: 11/23/2022] Open
Abstract
Background Breast cancers show variations in the number and biological aggressiveness of cancer stem cells that correlate with their clinico-prognostic and molecular heterogeneity. Thus, prognostic stratification of breast cancers based on cancer stem cells might help guide patient management. Methods We derived a 20-gene stem cell signature from the transcriptional profile of normal mammary stem cells, capable of identifying breast cancers with a homogeneous profile and poor prognosis in in silico analyses. The clinical value of this signature was assessed in a prospective-retrospective cohort of 2, 453 breast cancer patients. Models for predicting individual risk of metastasis were developed from expression data of the 20 genes in patients randomly assigned to a training set, using the ridge-penalized Cox regression, and tested in an independent validation set. Findings Analyses revealed that the 20-gene stem cell signature provided prognostic information in Triple-Negative and Luminal breast cancer patients, independently of standard clinicopathological parameters. Through functional studies in individual tumours, we correlated the risk score assigned by the signature with the proliferative and self-renewal potential of the cancer stem cell population. By retraining the 20-gene signature in Luminal patients, we derived the risk model, StemPrintER, which predicted early and late recurrence independently of standard prognostic factors. Interpretation Our findings indicate that the 20-gene stem cell signature, by its unique ability to interrogate the biology of cancer stem cells of the primary tumour, provides a reliable estimate of metastatic risk in Triple-Negative and Luminal breast cancer patients independently of standard clinicopathological parameters.
Collapse
|
28
|
Emura T, Matsui S, Chen HY. compound.Cox: Univariate feature selection and compound covariate for predicting survival. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019; 168:21-37. [PMID: 30527130 DOI: 10.1016/j.cmpb.2018.10.020] [Citation(s) in RCA: 48] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/28/2018] [Revised: 09/26/2018] [Accepted: 10/26/2018] [Indexed: 05/15/2023]
Abstract
BACKGROUND AND OBJECTIVE Univariate feature selection is one of the simplest and most commonly used techniques to develop a multigene predictor for survival. Presently, there is no software tailored to perform univariate feature selection and predictor construction. METHODS We develop the compound.Cox R package that implements univariate significance tests (via the Wald tests or score tests) for feature selection. We provide a cross-validation algorithm to measure predictive capability of selected genes and a permutation algorithm to assess the false discovery rate. We also provide three algorithms for constructing a multigene predictor (compound covariate, compound shrinkage, and copula-based methods), which are tailored to the subset of genes obtained from univariate feature selection. We demonstrate our package using survival data on the lung cancer patients. We examine the predictive capability of the developed algorithms by the lung cancer data and simulated data. RESULTS The developed R package, compound.Cox, is available on the CRAN repository. The statistical tools in compound.Cox allow researchers to determine an optimal significance level of the tests, thus providing researchers an optimal subset of genes for prediction. The package also allows researchers to compute the false discovery rate and various prediction algorithms.
Collapse
Affiliation(s)
- Takeshi Emura
- Graduate Institute of Statistics, National Central University, Zhongda Road, Zhongli District, Taoyuan 32001, Taiwan.
| | - Shigeyuki Matsui
- Department of Biostatistics, Nagoya University Graduate School of Medicine, 65 Tsurumai-cho, Showa-ku, Nagoya, 466-8550, Japan
| | - Hsuan-Yu Chen
- Institute of Statistical Science, Academia Sinica, 128 Academia Road Sec.2, Nankang Taipei 115, Taiwan
| |
Collapse
|
29
|
Mubeen S, Hoyt CT, Gemünd A, Hofmann-Apitius M, Fröhlich H, Domingo-Fernández D. The Impact of Pathway Database Choice on Statistical Enrichment Analysis and Predictive Modeling. Front Genet 2019. [PMID: 31824580 DOI: 10.3389/fgene.2019.01203/bibtex] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/06/2023] Open
Abstract
Pathway-centric approaches are widely used to interpret and contextualize -omics data. However, databases contain different representations of the same biological pathway, which may lead to different results of statistical enrichment analysis and predictive models in the context of precision medicine. We have performed an in-depth benchmarking of the impact of pathway database choice on statistical enrichment analysis and predictive modeling. We analyzed five cancer datasets using three major pathway databases and developed an approach to merge several databases into a single integrative one: MPath. Our results show that equivalent pathways from different databases yield disparate results in statistical enrichment analysis. Moreover, we observed a significant dataset-dependent impact on the performance of machine learning models on different prediction tasks. In some cases, MPath significantly improved prediction performance and also reduced the variance of prediction performances. Furthermore, MPath yielded more consistent and biologically plausible results in statistical enrichment analyses. In summary, this benchmarking study demonstrates that pathway database choice can influence the results of statistical enrichment analysis and predictive modeling. Therefore, we recommend the use of multiple pathway databases or integrative ones.
Collapse
Affiliation(s)
- Sarah Mubeen
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
- Bonn-Aachen International Center for IT, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - Charles Tapley Hoyt
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
- Bonn-Aachen International Center for IT, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - André Gemünd
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
| | - Martin Hofmann-Apitius
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
- Bonn-Aachen International Center for IT, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - Holger Fröhlich
- Bonn-Aachen International Center for IT, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - Daniel Domingo-Fernández
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
- Bonn-Aachen International Center for IT, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| |
Collapse
|
30
|
Yi M, Zhu R, Stephens RM. GradientScanSurv-An exhaustive association test method for gene expression data with censored survival outcome. PLoS One 2018; 13:e0207590. [PMID: 30517129 PMCID: PMC6281197 DOI: 10.1371/journal.pone.0207590] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2018] [Accepted: 11/03/2018] [Indexed: 12/22/2022] Open
Abstract
Accurate assessment of the association between continuous variables such as gene expression and survival is a critical aspect of precision medicine. In this report, we provide a review of some of the available survival analysis and validation tools by referencing published studies that have utilized these tools. We have identified pitfalls associated with the assumptions inherent in those applications that have the potential to impact scientific research through their potential bias. In order to overcome these pitfalls, we have developed a novel method that enables the logrank test method to handle continuous variables that comprehensively evaluates survival association with derived aggregate statistics. This is accomplished by exhaustively considering all the cutpoints across the full expression gradient. Direct side-by-side comparisons, global ROC analysis, and evaluation of the ability to capture relevant biological themes based on current understanding of RAS biology all demonstrated that the new method shows better consistency between multiple datasets of the same disease, better reproducibility and robustness, and better detection power to uncover biological relevance within the selected datasets over the available survival analysis methods on univariate gene expression and penalized linear model-based methods.
Collapse
Affiliation(s)
- Ming Yi
- NCI RAS Initiative, Cancer Research Technology Program, Frederick National Laboratory for Cancer Research, Frederick, MD, United States of America
- * E-mail:
| | - Ruoqing Zhu
- Department of Statistics, University of Illinois Urbana-Champaign, Champaign, IL, United States of America
| | - Robert M. Stephens
- NCI RAS Initiative, Cancer Research Technology Program, Frederick National Laboratory for Cancer Research, Frederick, MD, United States of America
| |
Collapse
|
31
|
Bazzoli C, Lambert-Lacroix S. Classification based on extensions of LS-PLS using logistic regression: application to clinical and multiple genomic data. BMC Bioinformatics 2018; 19:314. [PMID: 30189832 PMCID: PMC6127926 DOI: 10.1186/s12859-018-2311-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2018] [Accepted: 08/13/2018] [Indexed: 01/02/2023] Open
Abstract
BACKGROUND To address high-dimensional genomic data, most of the proposed prediction methods make use of genomic data alone without considering clinical data, which are often available and known to have predictive value. Recent studies suggest that combining clinical and genomic information may improve predictions. We consider here methods for classification purposes that simultaneously use both types of variables but apply dimensionality reduction only to the high-dimensional genomic ones. RESULTS Using partial least squares (PLS), we propose some one-step approaches based on three extensions of the least squares (LS)-PLS method for logistic regression. A comparison of their prediction performances via a simulation and on real data sets from cancer studies is conducted. CONCLUSION In general, those methods using only clinical data or only genomic data perform poorly. The advantage of using LS-PLS methods for classification and their performances are shown and then used to analyze clinical and genomic data. The corresponding prediction results are encouraging and stable regardless of the data set and/or number of selected features. These extensions have been implemented in the R package lsplsGlm to enhance their use.
Collapse
Affiliation(s)
- Caroline Bazzoli
- Laboratoire Jean Kuntzman, Univ. Grenoble-Alpes, 700 avenue centrale, Saint Martin d’Hères, 38401 France
| | | |
Collapse
|
32
|
Ow GS, Tang Z, Kuznetsov VA. Big data and computational biology strategy for personalized prognosis. Oncotarget 2018; 7:40200-40220. [PMID: 27229533 PMCID: PMC5130003 DOI: 10.18632/oncotarget.9571] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2015] [Accepted: 05/01/2016] [Indexed: 01/05/2023] Open
Abstract
The era of big data and precision medicine has led to accumulation of massive datasets of gene expression data and clinical information of patients. For a new patient, we propose that identification of a highly similar reference patient from an existing patient database via similarity matching of both clinical and expression data could be useful for predicting the prognostic risk or therapeutic efficacy. Here, we propose a novel methodology to predict disease/treatment outcome via analysis of the similarity between any pair of patients who are each characterized by a certain set of pre-defined biological variables (biomarkers or clinical features) represented initially as a prognostic binary variable vector (PBVV) and subsequently transformed to a prognostic signature vector (PSV). Our analyses revealed that Euclidean distance rather correlation distance measure was effective in defining an unbiased similarity measure calculated between two PSVs. We implemented our methods to high-grade serous ovarian cancer (HGSC) based on a 36-mRNA predictor that was previously shown to stratify patients into 3 distinct prognostic subgroups. We studied and revealed that patient's age, when converted into binary variable, was positively correlated with the overall risk of succumbing to the disease. When applied to an independent testing dataset, the inclusion of age into the molecular predictor provided more robust personalized prognosis of overall survival correlated with the therapeutic response of HGSC and provided benefit for treatment targeting of the tumors in HGSC patients. Finally, our method can be generalized and implemented in many other diseases to accurately predict personalized patients’ outcomes.
Collapse
Affiliation(s)
| | | | - Vladimir A Kuznetsov
- Bioinformatics Institute, Singapore 138671.,School of Computer Engineering, Nanyang Technological University, Singapore 639798
| |
Collapse
|
33
|
Covell DG. A data mining approach for identifying pathway-gene biomarkers for predicting clinical outcome: A case study of erlotinib and sorafenib. PLoS One 2017; 12:e0181991. [PMID: 28792525 PMCID: PMC5549706 DOI: 10.1371/journal.pone.0181991] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2016] [Accepted: 07/10/2017] [Indexed: 12/28/2022] Open
Abstract
A novel data mining procedure is proposed for identifying potential pathway-gene biomarkers from preclinical drug sensitivity data for predicting clinical responses to erlotinib or sorafenib. The analysis applies linear ridge regression modeling to generate a small (N~1000) set of baseline gene expressions that jointly yield quality predictions of preclinical drug sensitivity data and clinical responses. Standard clustering of the pathway-gene combinations from gene set enrichment analysis of this initial gene set, according to their shared appearance in molecular function pathways, yields a reduced (N~300) set of potential pathway-gene biomarkers. A modified method for quantifying pathway fitness is used to determine smaller numbers of over and under expressed genes that correspond with favorable and unfavorable clinical responses. Detailed literature-based evidence is provided in support of the roles of these under and over expressed genes in compound efficacy. RandomForest analysis of potential pathway-gene biomarkers finds average treatment prediction errors of 10% and 22%, respectively, for patients receiving erlotinib or sorafenib that had a favorable clinical response. Higher errors were found for both compounds when predicting an unfavorable clinical response. Collectively these results suggest complementary roles for biomarker genes and biomarker pathways when predicting clinical responses from preclinical data.
Collapse
Affiliation(s)
- David G. Covell
- Information Technology Branch, Developmental Therapeutics Program, National Cancer Institute, Frederick, MD, United States of America
| |
Collapse
|
34
|
Rahman MS, Sultana M. Performance of Firth-and logF-type penalized methods in risk prediction for small or sparse binary data. BMC Med Res Methodol 2017; 17:33. [PMID: 28231767 PMCID: PMC5324225 DOI: 10.1186/s12874-017-0313-9] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2016] [Accepted: 02/16/2017] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND When developing risk models for binary data with small or sparse data sets, the standard maximum likelihood estimation (MLE) based logistic regression faces several problems including biased or infinite estimate of the regression coefficient and frequent convergence failure of the likelihood due to separation. The problem of separation occurs commonly even if sample size is large but there is sufficient number of strong predictors. In the presence of separation, even if one develops the model, it produces overfitted model with poor predictive performance. Firth-and logF-type penalized regression methods are popular alternative to MLE, particularly for solving separation-problem. Despite the attractive advantages, their use in risk prediction is very limited. This paper evaluated these methods in risk prediction in comparison with MLE and other commonly used penalized methods such as ridge. METHODS The predictive performance of the methods was evaluated through assessing calibration, discrimination and overall predictive performance using an extensive simulation study. Further an illustration of the methods were provided using a real data example with low prevalence of outcome. RESULTS The MLE showed poor performance in risk prediction in small or sparse data sets. All penalized methods offered some improvements in calibration, discrimination and overall predictive performance. Although the Firth-and logF-type methods showed almost equal amount of improvement, Firth-type penalization produces some bias in the average predicted probability, and the amount of bias is even larger than that produced by MLE. Of the logF(1,1) and logF(2,2) penalization, logF(2,2) provides slight bias in the estimate of regression coefficient of binary predictor and logF(1,1) performed better in all aspects. Similarly, ridge performed well in discrimination and overall predictive performance but it often produces underfitted model and has high rate of convergence failure (even the rate is higher than that for MLE), probably due to the separation problem. CONCLUSIONS The logF-type penalized method, particularly logF(1,1) could be used in practice when developing risk model for small or sparse data sets.
Collapse
Affiliation(s)
- M Shafiqur Rahman
- Institute of Statistical Research and Training, University of Dhaka, Dhaka, Bangladesh.
| | - Mahbuba Sultana
- Institute of Statistical Research and Training, University of Dhaka, Dhaka, Bangladesh
| |
Collapse
|
35
|
Seifert M, Friedrich B, Beyer A. Importance of rare gene copy number alterations for personalized tumor characterization and survival analysis. Genome Biol 2016; 17:204. [PMID: 27716417 PMCID: PMC5046221 DOI: 10.1186/s13059-016-1058-1] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2016] [Accepted: 09/06/2016] [Indexed: 12/24/2022] Open
Abstract
It has proven exceedingly difficult to ascertain rare copy number alterations (CNAs) that may have strong effects in individual tumors. We show that a regulatory network inferred from gene expression and gene copy number data of 768 human cancer cell lines can be used to quantify the impact of patient-specific CNAs on survival signature genes. A focused analysis of tumors from six tissues reveals that rare patient-specific gene CNAs often have stronger effects on signature genes than frequent gene CNAs. Further comparison to a related network-based approach shows that the integration of indirectly acting gene CNAs significantly improves the survival analysis.
Collapse
Affiliation(s)
- Michael Seifert
- Carl Gustav Carus Faculty of Medicine, Technische Universität Dresden, Institute for Medical Informatics and Biometry, Fetscherstr. 74, Dresden, 01307, Germany. .,National Center for Tumor Diseases (NCT), Dresden, Germany. .,Cellular Networks and Systems Biology, CECAD, University of Cologne, Joseph-Stelzmann-Str. 26, Cologne, 50931, Germany.
| | - Betty Friedrich
- Institute of Molecular Systems Biology, Auguste-Piccard-Hof 1, Zurich, 8093, Switzerland
| | - Andreas Beyer
- Cellular Networks and Systems Biology, CECAD, University of Cologne, Joseph-Stelzmann-Str. 26, Cologne, 50931, Germany
| |
Collapse
|
36
|
Pölsterl S, Conjeti S, Navab N, Katouzian A. Survival analysis for high-dimensional, heterogeneous medical data: Exploring feature extraction as an alternative to feature selection. Artif Intell Med 2016; 72:1-11. [PMID: 27664504 DOI: 10.1016/j.artmed.2016.07.004] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2016] [Revised: 06/15/2016] [Accepted: 07/25/2016] [Indexed: 10/21/2022]
Abstract
BACKGROUND In clinical research, the primary interest is often the time until occurrence of an adverse event, i.e., survival analysis. Its application to electronic health records is challenging for two main reasons: (1) patient records are comprised of high-dimensional feature vectors, and (2) feature vectors are a mix of categorical and real-valued features, which implies varying statistical properties among features. To learn from high-dimensional data, researchers can choose from a wide range of methods in the fields of feature selection and feature extraction. Whereas feature selection is well studied, little work focused on utilizing feature extraction techniques for survival analysis. RESULTS We investigate how well feature extraction methods can deal with features having varying statistical properties. In particular, we consider multiview spectral embedding algorithms, which specifically have been developed for these situations. We propose to use random survival forests to accurately determine local neighborhood relations from right censored survival data. We evaluated 10 combinations of feature extraction methods and 6 survival models with and without intrinsic feature selection in the context of survival analysis on 3 clinical datasets. Our results demonstrate that for small sample sizes - less than 500 patients - models with built-in feature selection (Cox model with ℓ1 penalty, random survival forest, and gradient boosted models) outperform feature extraction methods by a median margin of 6.3% in concordance index (inter-quartile range: [-1.2%;14.6%]). CONCLUSIONS If the number of samples is insufficient, feature extraction methods are unable to reliably identify the underlying manifold, which makes them of limited use in these situations. For large sample sizes - in our experiments, 2500 samples or more - feature extraction methods perform as well as feature selection methods.
Collapse
Affiliation(s)
- Sebastian Pölsterl
- Computer Aided Medical Procedures, Technische Universität München, Boltzmannstraße 3, 85748 Garching bei München, Germany.
| | - Sailesh Conjeti
- Computer Aided Medical Procedures, Technische Universität München, Boltzmannstraße 3, 85748 Garching bei München, Germany.
| | - Nassir Navab
- Computer Aided Medical Procedures, Technische Universität München, Boltzmannstraße 3, 85748 Garching bei München, Germany; Johns Hopkins University, 3400 North Charles Street, Baltimore, MD 21218, USA.
| | - Amin Katouzian
- IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, USA.
| |
Collapse
|
37
|
Dama E, Melocchi V, Dezi F, Pirroni S, Carletti RM, Brambilla D, Bertalot G, Casiraghi M, Maisonneuve P, Barberis M, Viale G, Vecchi M, Spaggiari L, Bianchi F, Di Fiore PP. An Aggressive Subtype of Stage I Lung Adenocarcinoma with Molecular and Prognostic Characteristics Typical of Advanced Lung Cancers. Clin Cancer Res 2016; 23:62-72. [DOI: 10.1158/1078-0432.ccr-15-3005] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2015] [Revised: 05/11/2016] [Accepted: 05/31/2016] [Indexed: 11/16/2022]
|
38
|
Hieke S, Benner A, Schlenk RF, Schumacher M, Bullinger L, Binder H. Identifying Prognostic SNPs in Clinical Cohorts: Complementing Univariate Analyses by Resampling and Multivariable Modeling. PLoS One 2016; 11:e0155226. [PMID: 27159447 PMCID: PMC4861340 DOI: 10.1371/journal.pone.0155226] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2015] [Accepted: 04/26/2016] [Indexed: 11/18/2022] Open
Abstract
Clinical cohorts with time-to-event endpoints are increasingly characterized by measurements of a number of single nucleotide polymorphisms that is by a magnitude larger than the number of measurements typically considered at the gene level. At the same time, the size of clinical cohorts often is still limited, calling for novel analysis strategies for identifying potentially prognostic SNPs that can help to better characterize disease processes. We propose such a strategy, drawing on univariate testing ideas from epidemiological case-controls studies on the one hand, and multivariable regression techniques as developed for gene expression data on the other hand. In particular, we focus on stable selection of a small set of SNPs and corresponding genes for subsequent validation. For univariate analysis, a permutation-based approach is proposed to test at the gene level. We use regularized multivariable regression models for considering all SNPs simultaneously and selecting a small set of potentially important prognostic SNPs. Stability is judged according to resampling inclusion frequencies for both the univariate and the multivariable approach. The overall strategy is illustrated with data from a cohort of acute myeloid leukemia patients and explored in a simulation study. The multivariable approach is seen to automatically focus on a smaller set of SNPs compared to the univariate approach, roughly in line with blocks of correlated SNPs. This more targeted extraction of SNPs results in more stable selection at the SNP as well as at the gene level. Thus, the multivariable regression approach with resampling provides a perspective in the proposed analysis strategy for SNP data in clinical cohorts highlighting what can be added by regularized regression techniques compared to univariate analyses.
Collapse
Affiliation(s)
- Stefanie Hieke
- Institute for Medical Biometry and Statistics, Medical Center- University Freiburg, Freiburg, Germany
- Freiburg Center for Data Analysis and Modeling, University Freiburg, Freiburg, Germany
- * E-mail:
| | - Axel Benner
- Division of Biostatistics, German Cancer Research Center, Heidelberg, Germany
| | - Richard F. Schlenk
- Department of Internal Medicine III, University Hospital of Ulm, Ulm, Germany
| | - Martin Schumacher
- Institute for Medical Biometry and Statistics, Medical Center- University Freiburg, Freiburg, Germany
| | - Lars Bullinger
- Department of Internal Medicine III, University Hospital of Ulm, Ulm, Germany
| | - Harald Binder
- Institute of Medical Biostatistics, Epidemiology and Informatics, University Medical Center Johannes Gutenberg University Mainz, Mainz, Germany
| |
Collapse
|
39
|
Network regularised Cox regression and multiplex network models to predict disease comorbidities and survival of cancer. Comput Biol Chem 2015; 59 Pt B:15-31. [DOI: 10.1016/j.compbiolchem.2015.08.010] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2015] [Revised: 08/21/2015] [Accepted: 08/25/2015] [Indexed: 12/17/2022]
|
40
|
Denis M, Tadesse MG. Evaluation of hierarchical models for integrative genomic analyses. Bioinformatics 2015; 32:738-46. [PMID: 26545823 DOI: 10.1093/bioinformatics/btv653] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2015] [Accepted: 11/03/2015] [Indexed: 12/26/2022] Open
Abstract
MOTIVATION Advances in high-throughput technologies have led to the acquisition of various types of -omic data on the same biological samples. Each data type gives independent and complementary information that can explain the biological mechanisms of interest. While several studies performing independent analyses of each dataset have led to significant results, a better understanding of complex biological mechanisms requires an integrative analysis of different sources of data. RESULTS Flexible modeling approaches, based on penalized likelihood methods and expectation-maximization (EM) algorithms, are studied and tested under various biological relationship scenarios between the different molecular features and their effects on a clinical outcome. The models are applied to genomic datasets from two cancer types in the Cancer Genome Atlas project: glioblastoma multiforme and ovarian serous cystadenocarcinoma. The integrative models lead to improved model fit and predictive performance. They also provide a better understanding of the biological mechanisms underlying patients' survival. AVAILABILITY AND IMPLEMENTATION Source code implementing the integrative models is freely available at https://github.com/mgt000/IntegrativeAnalysis along with example datasets and sample R script applying the models to these data. The TCGA datasets used for analysis are publicly available at https://tcga-data.nci.nih.gov/tcga/tcgaDownload.jsp CONTACT marie.denis@cirad.fr or mgt26@georgetown.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Marie Denis
- UMR AGAP, CIRAD, Montpellier, France, Department of Epidemiology, Harvard School of Public Health, Boston, MA, USA and
| | - Mahlet G Tadesse
- Department of Mathematics and Statistics, Georgetown University, Washington, DC, USA
| |
Collapse
|
41
|
The Current and Future Use of Ridge Regression for Prediction in Quantitative Genetics. BIOMED RESEARCH INTERNATIONAL 2015; 2015:143712. [PMID: 26273586 PMCID: PMC4529984 DOI: 10.1155/2015/143712] [Citation(s) in RCA: 50] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/28/2014] [Accepted: 12/24/2014] [Indexed: 01/05/2023]
Abstract
In recent years, there has been a considerable amount of research on the use of regularization methods for inference and prediction in quantitative genetics. Such research mostly focuses on selection of markers and shrinkage of their effects. In this review paper, the use of ridge regression for prediction in quantitative genetics using single-nucleotide polymorphism data is discussed. In particular, we consider (i) the theoretical foundations of ridge regression, (ii) its link to commonly used methods in animal breeding, (iii) the computational feasibility, and (iv) the scope for constructing prediction models with nonlinear effects (e.g., dominance and epistasis). Based on a simulation study we gauge the current and future potential of ridge regression for prediction of human traits using genome-wide SNP data. We conclude that, for outcomes with a relatively simple genetic architecture, given current sample sizes in most cohorts (i.e., N < 10,000) the predictive accuracy of ridge regression is slightly higher than the classical genome-wide association study approach of repeated simple regression (i.e., one regression per SNP). However, both capture only a small proportion of the heritability. Nevertheless, we find evidence that for large-scale initiatives, such as biobanks, sample sizes can be achieved where ridge regression compared to the classical approach improves predictive accuracy substantially.
Collapse
|
42
|
Enhancing the Lasso Approach for Developing a Survival Prediction Model Based on Gene Expression Data. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2015; 2015:259474. [PMID: 26146513 PMCID: PMC4469838 DOI: 10.1155/2015/259474] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/18/2014] [Accepted: 12/22/2014] [Indexed: 11/18/2022]
Abstract
In the past decade, researchers in oncology have sought to develop survival prediction models using gene expression data. The least absolute shrinkage and selection operator (lasso) has been widely used to select genes that truly correlated with a patient's survival. The lasso selects genes for prediction by shrinking a large number of coefficients of the candidate genes towards zero based on a tuning parameter that is often determined by a cross-validation (CV). However, this method can pass over (or fail to identify) true positive genes (i.e., it identifies false negatives) in certain instances, because the lasso tends to favor the development of a simple prediction model. Here, we attempt to monitor the identification of false negatives by developing a method for estimating the number of true positive (TP) genes for a series of values of a tuning parameter that assumes a mixture distribution for the lasso estimates. Using our developed method, we performed a simulation study to examine its precision in estimating the number of TP genes. Additionally, we applied our method to a real gene expression dataset and found that it was able to identify genes correlated with survival that a CV method was unable to detect.
Collapse
|
43
|
Zemmour C, Bertucci F, Finetti P, Chetrit B, Birnbaum D, Filleron T, Boher JM. Prediction of early breast cancer metastasis from DNA microarray data using high-dimensional cox regression models. Cancer Inform 2015; 14:129-38. [PMID: 25983547 PMCID: PMC4426954 DOI: 10.4137/cin.s17284] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2014] [Revised: 03/17/2015] [Accepted: 03/20/2015] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND DNA microarray studies identified gene expression signatures predictive of metastatic relapse in early breast cancer. Standard feature selection procedures applied to reduce the set of predictive genes did not take into account the correlation between genes. In this paper, we studied the performances of three high-dimensional regression methods – CoxBoost, LASSO (Least Absolute Shrinkage and Selection Operator), and Elastic net – to identify prognostic signatures in patients with early breast cancer. METHODS We analyzed three public retrospective datasets, including a total of 384 patients with axillary lymph node-negative breast cancer. The Amsterdam van’t Veer’s training set of 78 patients was used to determine the optimal gene sets and classifiers using sensitivity thresholds resulting in mis-classification of no more than 10% of the poor-prognosis group. To ensure the comparability between different methods, an automatic selection procedure was used to determine the number of genes included in each model. The van de Vijver’s and Desmedt’s datasets were used as validation sets to evaluate separately the prognostic performances of our classifiers. The results were compared to the original Amsterdam 70-gene classifier. RESULTS The automatic selection procedure reduced the number of predictive genes up to a minimum of six genes. In the two validation sets, the three models (Elastic net, LASSO, and CoxBoost) led to the definition of genomic classifiers predicting the 5-year metastatic status with similar performances, with respective 59, 56, and 54% accuracy, 83, 75, and 83% sensitivity, and 53, 52, and 48% specificity in the Desmedt’s dataset. In comparison, the Amsterdam 70-gene signature showed 45% accuracy, 97% sensitivity, and 34% specificity. The gene overlap and the classification concordance between the three classifiers were high. All the classifiers added significant prognostic information to that provided by the traditional prognostic factors and showed a very high overlap with respect to gene ontologies (GOs) associated with genes overexpressed in the predicted poor-prognosis vs. good-prognosis classes and centred on cell proliferation. Interestingly, all classifiers reported high sensitivity to predict the 4-year status of metastatic disease. CONCLUSIONS High-dimensional regression methods are attractive in prognostic studies because finding a small subset of genes may facilitate the transfer to the clinic, and also because they strengthen the robustness of the model by limiting the selection of false-positive predictive genes. With only six genes, the CoxBoost classifier predicted the 4-year status of metastatic disease with 93% sensitivity. Selecting a few genes related to ontologies other than cell proliferation might further improve the overall sensitivity performance.
Collapse
Affiliation(s)
- Christophe Zemmour
- Département de la Recherche Clinique et de l'Innovation, Unité de Biostatistique et de Méthodologie, Institut Paoli-Calmettes, Marseille, France
| | - François Bertucci
- Département d'Oncologie Moléculaire, Institut Paoli-Calmettes, Centre de Recherche en Cancérologie de Marseille, INSERM, CNRS, Marseille, France. ; Département d'Oncologie Médicale, Institut Paoli-Calmettes, Centre de Recherche en Cancérologie de Marseille, INSERM, CNRS, Marseille, France
| | - Pascal Finetti
- Département d'Oncologie Moléculaire, Institut Paoli-Calmettes, Centre de Recherche en Cancérologie de Marseille, INSERM, CNRS, Marseille, France
| | - Bernard Chetrit
- Centre de Recherche en Cancérologie de Marseille, INSERM, CNRS, Marseille, France
| | - Daniel Birnbaum
- Département d'Oncologie Moléculaire, Institut Paoli-Calmettes, Centre de Recherche en Cancérologie de Marseille, INSERM, CNRS, Marseille, France
| | - Thomas Filleron
- Bureau des Essais Cliniques, Cellule Biostatistique, Institut Claudius Regaud, Institut Universitaire du Cancer Toulouse Oncopôle, Toulouse, France
| | - Jean-Marie Boher
- Département de la Recherche Clinique et de l'Innovation, Unité de Biostatistique et de Méthodologie, Institut Paoli-Calmettes, Marseille, France
| |
Collapse
|
44
|
Gai LP, Liu H, Cui JH, Ji N, Ding XD, Sun C, Yu LS. Distributions of allele combination in single and cross loci among patients with several kinds of chronic diseases and the normal population. Genomics 2015; 105:168-74. [PMID: 25561352 DOI: 10.1016/j.ygeno.2014.12.008] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2014] [Revised: 12/05/2014] [Accepted: 12/27/2014] [Indexed: 11/27/2022]
Abstract
Genetic research has progressed along with scientific and technological developments. However, it is difficult to identify frequency differences in a particular allele distribution at a single locus. Such differences can be identified by examining the allele combination distribution. We explored different mathematical methods for statistical analyses to assess the association between the genotype and phenotype. We investigated the frequency distributions of alleles, combinations of single-locus genes, and combinations of cross-loci genes at 15 loci using 447 blood samples of 200 normal subjects, 72 patients with chronic obstructive pulmonary resistance, 50 liver cancers, 75 stomach cancers and 50 hematencephalon and identified each population as having a unique gene distribution and that the distribution followed certain rules. The probability of illness followed different rules and had apparent specificity. Differences obtained using statistics of combinations of cross-loci genes are superior to single-locus gene statistics, and combinations of single-locus gene statistics are better than allelic statistics.
Collapse
Affiliation(s)
- Li-ping Gai
- Department of Physics, Dalian Medical University, Dalian 116044, China
| | - Hui Liu
- College of Medical Laboratory, Dalian Medical University, Dalian 116044, China.
| | - Jing-hui Cui
- College of Medical Laboratory, Dalian Medical University, Dalian 116044, China
| | - Na Ji
- College of Medical Laboratory, Dalian Medical University, Dalian 116044, China
| | - Xiao-dong Ding
- Department of Physics, Dalian Medical University, Dalian 116044, China
| | - Cui Sun
- College of Medical Laboratory, Dalian Medical University, Dalian 116044, China
| | - Lai-shui Yu
- College of Medical Laboratory, Dalian Medical University, Dalian 116044, China
| |
Collapse
|
45
|
Predicting the phenotypic values of physiological traits using SNP genotype and gene expression data in mice. PLoS One 2014; 9:e115532. [PMID: 25541966 PMCID: PMC4277360 DOI: 10.1371/journal.pone.0115532] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2014] [Accepted: 11/25/2014] [Indexed: 01/22/2023] Open
Abstract
Predicting phenotypes using genome-wide genetic variation and gene expression data is useful in several fields, such as human biology and medicine, as well as in crop and livestock breeding. However, for phenotype prediction using gene expression data for mammals, studies remain scarce, as the available data on gene expression profiling are currently limited. By integrating a few sources of relevant data that are available in mice, this study investigated the accuracy of phenotype prediction for several physiological traits. Gene expression data from two tissues as well as single nucleotide polymorphisms (SNPs) were used. For the studied traits, the variance of the effects of the expression levels was more likely to differ among the genes than were the effects of SNPs. For the glucose concentration, the total cholesterol amount, and the total tidal volume, the accuracy by cross validation tended to be higher when the gene expression data rather than the SNP genotype data were used, and a statistically significant increase in the accuracy was obtained when the gene expression data from the liver were used alone or jointly with the SNP genotype data. For these traits, there were no additional gains in accuracy from using the gene expression data of both the liver and lung compared to that of individual use. The accuracy of prediction using genes that were selected differently was examined; the use of genes with a higher tissue specificity tended to result in an accuracy that was similar to or greater than that associated with the use of all of the available genes for traits such as the glucose concentration and total cholesterol amount. Although relatively few animals were evaluated, the current results suggest that gene expression levels could be used as explanatory variables. However, further studies are essential to confirm our findings using additional animal samples.
Collapse
|
46
|
Supervised wavelet method to predict patient survival from gene expression data. ScientificWorldJournal 2014; 2014:618412. [PMID: 25538955 PMCID: PMC4235600 DOI: 10.1155/2014/618412] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2014] [Accepted: 10/03/2014] [Indexed: 11/18/2022] Open
Abstract
In microarray studies, the number of samples is relatively small compared to the number of genes per sample. An important aspect of microarray studies is the prediction of patient survival based on their gene expression profile. This naturally calls for the use of a dimension reduction procedure together with the survival prediction model. In this study, a new method based on combining wavelet approximation coefficients and Cox regression was presented. The proposed method was compared with supervised principal component and supervised partial least squares methods. The different fitted Cox models based on supervised wavelet approximation coefficients, the top number of supervised principal components, and partial least squares components were applied to the data. The results showed that the prediction performance of the Cox model based on supervised wavelet feature extraction was superior to the supervised principal components and partial least squares components. The results suggested the possibility of developing new tools based on wavelets for the dimensionally reduction of microarray data sets in the context of survival analysis.
Collapse
|
47
|
Bastien P, Bertrand F, Meyer N, Maumy-Bertrand M. Deviance residuals-based sparse PLS and sparse kernel PLS regression for censored data. ACTA ACUST UNITED AC 2014; 31:397-404. [PMID: 25286920 DOI: 10.1093/bioinformatics/btu660] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION A vast literature from the past decade is devoted to relating gene profiles and subject survival or time to cancer recurrence. Biomarker discovery from high-dimensional data, such as transcriptomic or single nucleotide polymorphism profiles, is a major challenge in the search for more precise diagnoses. The proportional hazard regression model suggested by Cox (1972), to study the relationship between the time to event and a set of covariates in the presence of censoring is the most commonly used model for the analysis of survival data. However, like multivariate regression, it supposes that more observations than variables, complete data, and not strongly correlated variables are available. In practice, when dealing with high-dimensional data, these constraints are crippling. Collinearity gives rise to issues of over-fitting and model misidentification. Variable selection can improve the estimation accuracy by effectively identifying the subset of relevant predictors and enhance the model interpretability with parsimonious representation. To deal with both collinearity and variable selection issues, many methods based on least absolute shrinkage and selection operator penalized Cox proportional hazards have been proposed since the reference paper of Tibshirani. Regularization could also be performed using dimension reduction as is the case with partial least squares (PLS) regression. We propose two original algorithms named sPLSDR and its non-linear kernel counterpart DKsPLSDR, by using sparse PLS regression (sPLS) based on deviance residuals. We compared their predicting performance with state-of-the-art algorithms on both simulated and real reference benchmark datasets. RESULTS sPLSDR and DKsPLSDR compare favorably with other methods in their computational time, prediction and selectivity, as indicated by results based on benchmark datasets. Moreover, in the framework of PLS regression, they feature other useful tools, including biplots representation, or the ability to deal with missing data. Therefore, we view them as a useful addition to the toolbox of estimation and prediction methods for the widely used Cox's model in the high-dimensional and low-sample size settings. AVAILABILITY AND IMPLEMENTATION The R-package plsRcox is available on the CRAN and is maintained by Frédéric Bertrand. http://cran.r-project.org/web/packages/plsRcox/index.html. CONTACT pbastien@rd.loreal.com or fbertran@math.unistra.fr. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Philippe Bastien
- L'Oréal Recherche & Innovation, 93601 Aulnay-sous-Bois, IRMA, CNRS UMR 7501, Labex IRMIA, Université de Strasbourg, 67084 Strasbourg Cedex, INSERM EA3430, Laboratoire de Biostatistique, Faculté de Médecine de Strasbourg, Labex IRMIA, Université de Strasbourg, 67085 Strasbourg Cedex, France
| | - Frédéric Bertrand
- L'Oréal Recherche & Innovation, 93601 Aulnay-sous-Bois, IRMA, CNRS UMR 7501, Labex IRMIA, Université de Strasbourg, 67084 Strasbourg Cedex, INSERM EA3430, Laboratoire de Biostatistique, Faculté de Médecine de Strasbourg, Labex IRMIA, Université de Strasbourg, 67085 Strasbourg Cedex, France
| | - Nicolas Meyer
- L'Oréal Recherche & Innovation, 93601 Aulnay-sous-Bois, IRMA, CNRS UMR 7501, Labex IRMIA, Université de Strasbourg, 67084 Strasbourg Cedex, INSERM EA3430, Laboratoire de Biostatistique, Faculté de Médecine de Strasbourg, Labex IRMIA, Université de Strasbourg, 67085 Strasbourg Cedex, France
| | - Myriam Maumy-Bertrand
- L'Oréal Recherche & Innovation, 93601 Aulnay-sous-Bois, IRMA, CNRS UMR 7501, Labex IRMIA, Université de Strasbourg, 67084 Strasbourg Cedex, INSERM EA3430, Laboratoire de Biostatistique, Faculté de Médecine de Strasbourg, Labex IRMIA, Université de Strasbourg, 67085 Strasbourg Cedex, France
| |
Collapse
|
48
|
Zhao X, Zhou X. Sufficient dimension reduction on the mean and rate functions of recurrent events. Stat Med 2014; 33:3693-709. [PMID: 24687612 DOI: 10.1002/sim.6160] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2013] [Accepted: 03/10/2014] [Indexed: 11/08/2022]
Abstract
The counting process with a Cox-type intensity function has been extensively applied to analyze recurrent event data, which assume that the underlying counting process is a time-transformed Poisson process and that the covariates have multiplicative or additive effects on the mean and rate functions of the counting process. The existing statistical inference, however, often encounters difficulties due to high-dimensional covariates, such as in gene expression and single nucleotide polymorphism data that have revolutionized our understanding of cancer recurrence and other diseases. In this paper, a technique of sufficient dimension reduction is applied to the mean and rate function for the number of occurrences of events over time. A two-step procedure is proposed to estimate the model components: first, a nonparametric estimator is proposed for the baseline, and then the basis of the central subspace and its dimension are estimated through a modified slicing inverse regression. On the basis of the estimated structural dimension and on the basis of the central subspace, we can estimate the regression function by using the local linear regression. A simulation is performed to confirm and assess the theoretical findings, and an application is demonstrated on a set of chronic granulomatous disease data.
Collapse
Affiliation(s)
- Xiaobing Zhao
- School of Mathematics and Statistics, Zhejiang University of Finance and Economics, Hangzhou, Zhejiang Province, China
| | | |
Collapse
|
49
|
Sufficient dimension reduction on marginal regression for gaps of recurrent events. J MULTIVARIATE ANAL 2014. [DOI: 10.1016/j.jmva.2014.01.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
50
|
Forward Stagewise Shrinkage and Addition for High Dimensional Censored Regression. STATISTICS IN BIOSCIENCES 2014; 7:225-244. [PMID: 26904152 DOI: 10.1007/s12561-014-9114-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
Despite enormous development on variable selection approaches in recent years, modeling and selection of high dimensional censored regression remains a challenging question. When the number of predictors p far exceeds the number of observational units n and the outcome is censored, computations of existing solutions often become difficult, or even infeasible in some situations, while performances frequently deteriorate. In this article, we aim at simultaneous model estimation and variable selection for Cox proportional hazards models with high dimensional covariates. We propose a forward stage-wise shrinkage and addition approach for that purpose. Our proposal extends a popular statistical learning technique, the boosting method. It inherits the flexible nature of boosting and is straightforward to extend to nonlinear Cox models. Meanwhile it advances the classical boosting method by adding explicit variable selection and substantially reducing the number of iterations to the algorithm convergence. Our intensive simulations have showed that the new method enjoys a competitive performance in Cox models with both p < n and p ≥ n scenarios. The new method was also illustrated with analysis of two real microarray survival datasets.
Collapse
|