1
|
Gliozzo J, Mesiti M, Notaro M, Petrini A, Patak A, Puertas-Gallardo A, Paccanaro A, Valentini G, Casiraghi E. Heterogeneous data integration methods for patient similarity networks. Brief Bioinform 2022; 23:6604996. [PMID: 35679533 PMCID: PMC9294435 DOI: 10.1093/bib/bbac207] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2021] [Revised: 04/14/2022] [Accepted: 05/04/2022] [Indexed: 12/29/2022] Open
Abstract
Patient similarity networks (PSNs), where patients are represented as nodes and their similarities as weighted edges, are being increasingly used in clinical research. These networks provide an insightful summary of the relationships among patients and can be exploited by inductive or transductive learning algorithms for the prediction of patient outcome, phenotype and disease risk. PSNs can also be easily visualized, thus offering a natural way to inspect complex heterogeneous patient data and providing some level of explainability of the predictions obtained by machine learning algorithms. The advent of high-throughput technologies, enabling us to acquire high-dimensional views of the same patients (e.g. omics data, laboratory data, imaging data), calls for the development of data fusion techniques for PSNs in order to leverage this rich heterogeneous information. In this article, we review existing methods for integrating multiple biomedical data views to construct PSNs, together with the different patient similarity measures that have been proposed. We also review methods that have appeared in the machine learning literature but have not yet been applied to PSNs, thus providing a resource to navigate the vast machine learning literature existing on this topic. In particular, we focus on methods that could be used to integrate very heterogeneous datasets, including multi-omics data as well as data derived from clinical information and medical imaging.
Collapse
Affiliation(s)
- Jessica Gliozzo
- AnacletoLab - Computer Science Department, Universitá degli Studi di Milano, Via Celoria 18, 20135, Milan, Italy.,European Commission, Joint Research Centre (JRC), Ispra (VA), Italy.,CINI, Infolife National Laboratory, Roma, Italy
| | - Marco Mesiti
- AnacletoLab - Computer Science Department, Universitá degli Studi di Milano, Via Celoria 18, 20135, Milan, Italy.,CINI, Infolife National Laboratory, Roma, Italy
| | - Marco Notaro
- AnacletoLab - Computer Science Department, Universitá degli Studi di Milano, Via Celoria 18, 20135, Milan, Italy.,CINI, Infolife National Laboratory, Roma, Italy
| | - Alessandro Petrini
- AnacletoLab - Computer Science Department, Universitá degli Studi di Milano, Via Celoria 18, 20135, Milan, Italy.,CINI, Infolife National Laboratory, Roma, Italy
| | - Alex Patak
- European Commission, Joint Research Centre (JRC), Ispra (VA), Italy
| | | | - Alberto Paccanaro
- Department of Computer Science, Royal Holloway, University of London, Egham, TW20 0EX UK.,School of Applied Mathematics (EMAp), Fundação Getúlio Vargas, Rio de Janeiro Brazil
| | - Giorgio Valentini
- AnacletoLab - Computer Science Department, Universitá degli Studi di Milano, Via Celoria 18, 20135, Milan, Italy.,CINI, Infolife National Laboratory, Roma, Italy.,DSRC UNIMI, Data Science Research Center, Milano, 20135, Italy.,ELLIS, European Laboratory for Learning and Intelligent Systems, Berlin, Germany
| | - Elena Casiraghi
- AnacletoLab - Computer Science Department, Universitá degli Studi di Milano, Via Celoria 18, 20135, Milan, Italy.,CINI, Infolife National Laboratory, Roma, Italy
| |
Collapse
|
2
|
Wu Y, Sa Y, Guo Y, Li Q, Zhang N. Identification of WHO II/III gliomas by 16 prognostic-related gene signatures using machine learning methods. Curr Med Chem 2021; 29:1622-1639. [PMID: 34455959 DOI: 10.2174/0929867328666210827103049] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2021] [Revised: 05/27/2021] [Accepted: 05/28/2021] [Indexed: 11/22/2022]
Abstract
BACKGROUND It is found that the prognosis of gliomas of the same grade has large differences among World Health Organization(WHO) grade II and III in clinical observation. Therefore, a better understanding of the genetics and molecular mechanisms underlying WHO grade II and III gliomas is required, with the aim of developing a classification scheme at the molecular level rather than the conventional pathological morphology level. METHOD We performed survival analysis combined with machine learning methods of Least Absolute Shrinkage and Selection Operator using expression datasets downloaded from the Chinese Glioma Genome Atlas as well as The Cancer Genome Atlas. Risk scores were calculated by the product of expression level of overall survival-related genes and their multivariate Cox proportional hazards regression coefficients. WHO grade II and III gliomas were categorized into the low-risk subgroup, medium-risk subgroup, and high-risk subgroup. We used the 16 prognostic-related genes as input features to build a classification model based on prognosis using a fully connected neural network. Gene function annotations were also performed. RESULTS The 16 genes (AKNAD1, C7orf13, CDK20, CHRFAM7A, CHRNA1, EFNB1, GAS1, HIST2H2BE, KCNK3, KLHL4, LRRK2, NXPH3, PIGZ, SAMD5, ERINC2, and SIX6) related to the glioma prognosis were screened. The 16 selected genes were associated with the development of gliomas and carcinogenesis. The accuracy of an external validation data set of the fully connected neural network model from the two cohorts reached 95.5%. Our method has good potential capability in classifying WHO grade II and III gliomas into low-risk, medium-risk, and high-risk subgroups. The subgroups showed significant (P<0.01) differences in overall survival. CONCLUSION This resulted in the identification of 16 genes that were related to the prognosis of gliomas. Here we developed a computational method to discriminate WHO grade II and III gliomas into three subgroups with distinct prognoses. The gene expression-based method provides a reliable alternative to determine the prognosis of gliomas.
Collapse
Affiliation(s)
- YaMeng Wu
- Department of Biomedical Engineering, Tianjin Key Lab of BME Measurement, Tianjin University, Tianjin. China
| | - Yu Sa
- Department of Biomedical Engineering, Tianjin Key Lab of BME Measurement, Tianjin University, Tianjin. China
| | - Yu Guo
- Department of Biomedical Engineering, Tianjin Key Lab of BME Measurement, Tianjin University, Tianjin. China
| | - QiFeng Li
- Department of Biomedical Engineering, Tianjin Key Lab of BME Measurement, Tianjin University, Tianjin. China
| | - Ning Zhang
- Department of Biomedical Engineering, Tianjin Key Lab of BME Measurement, Tianjin University, Tianjin. China
| |
Collapse
|
3
|
Polewko-Klim A, Mnich K, Rudnicki WR. Robust Data Integration Method for Classification of Biomedical Data. J Med Syst 2021; 45:45. [PMID: 33624190 PMCID: PMC7902598 DOI: 10.1007/s10916-021-01718-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Accepted: 01/26/2021] [Indexed: 10/26/2022]
Abstract
We present a protocol for integrating two types of biological data - clinical and molecular - for more effective classification of patients with cancer. The proposed approach is a hybrid between early and late data integration strategy. In this hybrid protocol, the set of informative clinical features is extended by the classification results based on molecular data sets. The results are then treated as new synthetic variables. The hybrid protocol was applied to METABRIC breast cancer samples and TCGA urothelial bladder carcinoma samples. Various data types were used for clinical endpoint prediction: clinical data, gene expression, somatic copy number aberrations, RNA-Seq, methylation, and reverse phase protein array. The performance of the hybrid data integration was evaluated with a repeated cross validation procedure and compared with other methods of data integration: early integration and late integration via super learning. The hybrid method gave similar results to those obtained by the best of the tested variants of super learning. What is more, the hybrid method allowed for further sensitivity analysis and recursive feature elimination, which led to compact predictive models for cancer clinical endpoints. For breast cancer, the final model consists of eight clinical variables and two synthetic features obtained from molecular data. For urothelial bladder carcinoma, only two clinical features and one synthetic variable were necessary to build the best predictive model. We have shown that the inclusion of the synthetic variables based on the RNA expression levels and copy number alterations can lead to improved quality of prognostic tests. Thus, it should be considered for inclusion in wider medical practice.
Collapse
Affiliation(s)
- Aneta Polewko-Klim
- Institute of Computer Science, University of Bialystok, Bialystok, Poland
| | - Krzysztof Mnich
- Computational Center, University of Bialystok, Bialystok, Poland
| | - Witold R. Rudnicki
- Institute of Computer Science, University of Bialystok, Bialystok, Poland
- Computational Center, University of Bialystok, Bialystok, Poland
| |
Collapse
|
4
|
Gupta M, Gupta B. A novel gene expression test method of minimizing breast cancer risk in reduced cost and time by improving SVM-RFE gene selection method combined with LASSO. J Integr Bioinform 2020; 18:139-153. [PMID: 34171941 PMCID: PMC7856389 DOI: 10.1515/jib-2019-0110] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2019] [Accepted: 11/12/2020] [Indexed: 01/26/2023] Open
Abstract
Breast cancer is the leading diseases of death in women. It induces by a genetic mutation in breast cancer cells. Genetic testing has become popular to detect the mutation in genes but test cost is relatively expensive for several patients in developing countries like India. Genetic test takes between 2 and 4 weeks to decide the cancer. The time duration suffers the prognosis of genes because some patients have high rate of cancerous cell growth. In the research work, a cost and time efficient method is proposed to predict the gene expression level on the basis of clinical outcomes of the patient by using machine learning techniques. An improved SVM-RFE_MI gene selection technique is proposed to find the most significant genes related to breast cancer afterward explained variance statistical analysis is applied to extract the genes contain high variance. Least Absolute Shrinkage Selector Operator (LASSO) and Ridge regression techniques are used to predict the gene expression level. The proposed method predicts the expression of significant genes with reduced Root Mean Square Error and acceptable adjusted R-square value. As per the study, analysis of these selected genes is beneficial to diagnose the breast cancer at prior stage in reduced cost and time.
Collapse
Affiliation(s)
- Madhuri Gupta
- Department of Computer Engineering and Information Technology, ABES Engineering College, Ghaziabad, Uttar Pradesh, India
| | - Bharat Gupta
- Department of CS&IT, Jaypee Institute of Information Technology, Noida, Uttar Pradesh, India
| |
Collapse
|
5
|
Rodosthenous T, Shahrezaei V, Evangelou M. Integrating multi-OMICS data through sparse canonical correlation analysis for the prediction of complex traits: a comparison study. Bioinformatics 2020; 36:4616-4625. [PMID: 32437529 PMCID: PMC7750936 DOI: 10.1093/bioinformatics/btaa530] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2019] [Revised: 04/22/2020] [Accepted: 05/16/2020] [Indexed: 01/08/2023] Open
Abstract
Motivation Recent developments in technology have enabled researchers to collect multiple OMICS datasets for the same individuals. The conventional approach for understanding the relationships between the collected datasets and the complex trait of interest would be through the analysis of each OMIC dataset separately from the rest, or to test for associations between the OMICS datasets. In this work we show that integrating multiple OMICS datasets together, instead of analysing them separately, improves our understanding of their in-between relationships as well as the predictive accuracy for the tested trait. Several approaches have been proposed for the integration of heterogeneous and high-dimensional (p≫n) data, such as OMICS. The sparse variant of canonical correlation analysis (CCA) approach is a promising one that seeks to penalize the canonical variables for producing sparse latent variables while achieving maximal correlation between the datasets. Over the last years, a number of approaches for implementing sparse CCA (sCCA) have been proposed, where they differ on their objective functions, iterative algorithm for obtaining the sparse latent variables and make different assumptions about the original datasets. Results Through a comparative study we have explored the performance of the conventional CCA proposed by Parkhomenko et al., penalized matrix decomposition CCA proposed by Witten and Tibshirani and its extension proposed by Suo et al. The aforementioned methods were modified to allow for different penalty functions. Although sCCA is an unsupervised learning approach for understanding of the in-between relationships, we have twisted the problem as a supervised learning one and investigated how the computed latent variables can be used for predicting complex traits. The approaches were extended to allow for multiple (more than two) datasets where the trait was included as one of the input datasets. Both ways have shown improvement over conventional predictive models that include one or multiple datasets. Availability and implementation https://github.com/theorod93/sCCA. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Vahid Shahrezaei
- Department of Mathematics, Imperial College London, London SW7 2AZ, UK
| | - Marina Evangelou
- Department of Mathematics, Imperial College London, London SW7 2AZ, UK
| |
Collapse
|
6
|
Mohaiminul Islam M, Huang S, Ajwad R, Chi C, Wang Y, Hu P. An integrative deep learning framework for classifying molecular subtypes of breast cancer. Comput Struct Biotechnol J 2020; 18:2185-2199. [PMID: 32952934 PMCID: PMC7473884 DOI: 10.1016/j.csbj.2020.08.005] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2020] [Revised: 07/31/2020] [Accepted: 08/03/2020] [Indexed: 12/13/2022] Open
Abstract
Classification of breast cancer subtypes using multi-omics profiles is a difficult problem since the data sets are high-dimensional and highly correlated. Deep neural network (DNN) learning has demonstrated advantages over traditional methods as it does not require any hand-crafted features, but rather automatically extract features from raw data and efficiently analyze high-dimensional and correlated data. We aim to develop an integrative deep learning framework for classifying molecular subtypes of breast cancer. We collect copy number alteration and gene expression data measured on the same breast cancer patients from the Molecular Taxonomy of Breast Cancer International Consortium. We propose a deep learning model to integrate the omics datasets for predicting their molecular subtypes. The performance of our proposed DNN model is compared with some baseline models. Furthermore, we evaluate the misclassification of the subtypes using the learned deep features and explore their usefulness for clustering the breast cancer patients. We demonstrate that our proposed integrative deep learning model is superior to other deep learning and non-deep learning based models. Particularly, we get the best prediction result among the deep learning-based integration models when we integrate the two data sources using the concatenation layer in the models without sharing the weights. Using the learned deep features, we identify 6 breast cancer subgroups and show that Her2-enriched samples can be classified into more than one tumor subtype. Overall, the integrated model show better performance than those trained on individual data sources.
Collapse
Affiliation(s)
- Md. Mohaiminul Islam
- Department of Biochemistry and Medical Genetics, University of Manitoba, Winnipeg, Manitoba R3E 0W3, Canada
- Department of Computer Science, University of Manitoba, Winnipeg, Manitoba R3E 0W3, Canada
| | - Shujun Huang
- College of Pharmacy, University of Manitoba, Winnipeg, Manitoba R3E 0W3, Canada
| | - Rasif Ajwad
- Department of Biochemistry and Medical Genetics, University of Manitoba, Winnipeg, Manitoba R3E 0W3, Canada
- Department of Computer Science, University of Manitoba, Winnipeg, Manitoba R3E 0W3, Canada
| | - Chen Chi
- Department of Biochemistry and Medical Genetics, University of Manitoba, Winnipeg, Manitoba R3E 0W3, Canada
| | - Yang Wang
- Department of Computer Science, University of Manitoba, Winnipeg, Manitoba R3E 0W3, Canada
| | - Pingzhao Hu
- Department of Biochemistry and Medical Genetics, University of Manitoba, Winnipeg, Manitoba R3E 0W3, Canada
- Department of Computer Science, University of Manitoba, Winnipeg, Manitoba R3E 0W3, Canada
- Research Institute in Oncology and Hematology, University of Manitoba, Winnipeg, Manitoba R3E 0W3, Canada
| |
Collapse
|
7
|
Wani N, Raza K. Integrative approaches to reconstruct regulatory networks from multi-omics data: A review of state-of-the-art methods. Comput Biol Chem 2019; 83:107120. [PMID: 31499298 DOI: 10.1016/j.compbiolchem.2019.107120] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2018] [Revised: 02/22/2019] [Accepted: 08/27/2019] [Indexed: 02/06/2023]
Abstract
Data generation using high throughput technologies has led to the accumulation of diverse types of molecular data. These data have different types (discrete, real, string, etc.) and occur in various formats and sizes. Datasets including gene expression, miRNA expression, protein-DNA binding data (ChIP-Seq/ChIP-ChIP), mutation data (copy number variation, single nucleotide polymorphisms), annotations, interactions, and association data are some of the commonly used biological datasets to study various cellular mechanisms of living organisms. Each of them provides a unique, complementary and partly independent view of the genome and hence embed essential information about the regulatory mechanisms of genes and their products. Therefore, integrating these data and inferring regulatory interactions from them offer a system level of biological insight in predicting gene functions and their phenotypic outcomes. To study genome functionality through regulatory networks, different methods have been proposed for collective mining of information from an integrated dataset. We survey here integration methods that reconstruct regulatory networks using state-of-the-art techniques to handle multi-omics (i.e., genomic, transcriptomic, proteomic) and other biological datasets.
Collapse
Affiliation(s)
- Nisar Wani
- Govt. Degree College Baramulla, J & K, India; Department of Computer Science, jamia Milia Islamia, New Delhi, India
| | - Khalid Raza
- Department of Computer Science, jamia Milia Islamia, New Delhi, India.
| |
Collapse
|
8
|
Aouiche C, Chen B, Shang X. Predicting stage-specific cancer related genes and their dynamic modules by integrating multiple datasets. BMC Bioinformatics 2019; 20:194. [PMID: 31074385 PMCID: PMC6509867 DOI: 10.1186/s12859-019-2740-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND The mechanism of many complex diseases has not been detected accurately in terms of their stage evolution. Previous studies mainly focus on the identification of associations between genes and individual diseases, but less is known about their associations with specific disease stages. Exploring biological modules through different disease stages could provide valuable knowledge to genomic and clinical research. RESULTS In this study, we proposed a powerful and versatile framework to identify stage-specific cancer related genes and their dynamic modules by integrating multiple datasets. The discovered modules and their specific-signature genes were significantly enriched in many relevant known pathways. To further illustrate the dynamic evolution of these clinical-stages, a pathway network was built by taking individual pathways as vertices and the overlapping relationship between their annotated genes as edges. CONCLUSIONS The identified pathway network not only help us to understand the functional evolution of complex diseases, but also useful for clinical management to select the optimum treatment regimens and the appropriate drugs for patients.
Collapse
Affiliation(s)
- Chaima Aouiche
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China.,Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University Ministry of Industry and Information Technology, Xi'an, China
| | - Bolin Chen
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China. .,Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University Ministry of Industry and Information Technology, Xi'an, China.
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China.,Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University Ministry of Industry and Information Technology, Xi'an, China
| |
Collapse
|
9
|
López de Maturana E, Alonso L, Alarcón P, Martín-Antoniano IA, Pineda S, Piorno L, Calle ML, Malats N. Challenges in the Integration of Omics and Non-Omics Data. Genes (Basel) 2019; 10:genes10030238. [PMID: 30897838 PMCID: PMC6471713 DOI: 10.3390/genes10030238] [Citation(s) in RCA: 60] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2019] [Revised: 03/05/2019] [Accepted: 03/14/2019] [Indexed: 11/16/2022] Open
Abstract
Omics data integration is already a reality. However, few omics-based algorithms show enough predictive ability to be implemented into clinics or public health domains. Clinical/epidemiological data tend to explain most of the variation of health-related traits, and its joint modeling with omics data is crucial to increase the algorithm’s predictive ability. Only a small number of published studies performed a “real” integration of omics and non-omics (OnO) data, mainly to predict cancer outcomes. Challenges in OnO data integration regard the nature and heterogeneity of non-omics data, the possibility of integrating large-scale non-omics data with high-throughput omics data, the relationship between OnO data (i.e., ascertainment bias), the presence of interactions, the fairness of the models, and the presence of subphenotypes. These challenges demand the development and application of new analysis strategies to integrate OnO data. In this contribution we discuss different attempts of OnO data integration in clinical and epidemiological studies. Most of the reviewed papers considered only one type of omics data set, mainly RNA expression data. All selected papers incorporated non-omics data in a low-dimensionality fashion. The integrative strategies used in the identified papers adopted three modeling methods: Independent, conditional, and joint modeling. This review presents, discusses, and proposes integrative analytical strategies towards OnO data integration.
Collapse
Affiliation(s)
- Evangelina López de Maturana
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Lola Alonso
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Pablo Alarcón
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Isabel Adoración Martín-Antoniano
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Silvia Pineda
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Lucas Piorno
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - M Luz Calle
- Biosciences Department, University of Vic-Central University of Catalonia, Carrer de la Laura 13, 08570 Vic, Spain.
| | - Núria Malats
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| |
Collapse
|
10
|
Efficient Implementation of Penalized Regression for Genetic Risk Prediction. Genetics 2019; 212:65-74. [PMID: 30808621 PMCID: PMC6499521 DOI: 10.1534/genetics.119.302019] [Citation(s) in RCA: 38] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2018] [Accepted: 02/22/2019] [Indexed: 12/14/2022] Open
Abstract
Polygenic risk scores (PRS) combine many single-nucleotide polymorphisms into a score reflecting the genetic risk of developing a disease. Privé, Aschard, and Blum present an efficient implementation of penalized logistic regression... Polygenic Risk Scores (PRS) combine genotype information across many single-nucleotide polymorphisms (SNPs) to give a score reflecting the genetic risk of developing a disease. PRS might have a major impact on public health, possibly allowing for screening campaigns to identify high-genetic risk individuals for a given disease. The “Clumping+Thresholding” (C+T) approach is the most common method to derive PRS. C+T uses only univariate genome-wide association studies (GWAS) summary statistics, which makes it fast and easy to use. However, previous work showed that jointly estimating SNP effects for computing PRS has the potential to significantly improve the predictive performance of PRS as compared to C+T. In this paper, we present an efficient method for the joint estimation of SNP effects using individual-level data, allowing for practical application of penalized logistic regression (PLR) on modern datasets including hundreds of thousands of individuals. Moreover, our implementation of PLR directly includes automatic choices for hyper-parameters. We also provide an implementation of penalized linear regression for quantitative traits. We compare the performance of PLR, C+T and a derivation of random forests using both real and simulated data. Overall, we find that PLR achieves equal or higher predictive performance than C+T in most scenarios considered, while being scalable to biobank data. In particular, we find that improvement in predictive performance is more pronounced when there are few effects located in nearby genomic regions with correlated SNPs; for instance, in simulations, AUC values increase from 83% with the best prediction of C+T to 92.5% with PLR. We confirm these results in a data analysis of a case-control study for celiac disease where PLR and the standard C+T method achieve AUC values of 89% and of 82.5%. Applying penalized linear regression to 350,000 individuals of the UK Biobank, we predict height with a larger correlation than with the best prediction of C+T (∼65% instead of ∼55%), further demonstrating its scalability and strong predictive power, even for highly polygenic traits. Moreover, using 150,000 individuals of the UK Biobank, we are able to predict breast cancer better than C+T, fitting PLR in a few minutes only. In conclusion, this paper demonstrates the feasibility and relevance of using penalized regression for PRS computation when large individual-level datasets are available, thanks to the efficient implementation available in our R package bigstatsr.
Collapse
|
11
|
BASHIRI A, GHAZISAEEDI M, SAFDARI R, SHAHMORADI L, EHTESHAM H. Improving the Prediction of Survival in Cancer Patients by Using Machine Learning Techniques: Experience of Gene Expression Data: A Narrative Review. IRANIAN JOURNAL OF PUBLIC HEALTH 2017; 46:165-172. [PMID: 28451550 PMCID: PMC5402773] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
BACKGROUND Today, despite the many advances in early detection of diseases, cancer patients have a poor prognosis and the survival rates in them are low. Recently, microarray technologies have been used for gathering thousands data about the gene expression level of cancer cells. These types of data are the main indicators in survival prediction of cancer. This study highlights the improvement of survival prediction based on gene expression data by using machine learning techniques in cancer patients. METHODS This review article was conducted by searching articles between 2000 to 2016 in scientific databases and e-Journals. We used keywords such as machine learning, gene expression data, survival and cancer. RESULTS Studies have shown the high accuracy and effectiveness of gene expression data in comparison with clinical data in survival prediction. Because of bewildering and high volume of such data, studies have highlighted the importance of machine learning algorithms such as Artificial Neural Networks (ANN) to find out the distinctive signatures of gene expression in cancer patients. These algorithms improve the efficiency of probing and analyzing gene expression in cancer profiles for survival prediction of cancer. CONCLUSION By attention to the capabilities of machine learning techniques in proteomics and genomics applications, developing clinical decision support systems based on these methods for analyzing gene expression data can prevent potential errors in survival estimation, provide appropriate and individualized treatments to patients and improve the prognosis of cancers.
Collapse
|
12
|
An adaptive version of k-medoids to deal with the uncertainty in clustering heterogeneous data using an intermediary fusion approach. Knowl Inf Syst 2017. [DOI: 10.1007/s10115-016-0930-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
13
|
Cava C, Colaprico A, Bertoli G, Bontempi G, Mauri G, Castiglioni I. How interacting pathways are regulated by miRNAs in breast cancer subtypes. BMC Bioinformatics 2016; 17:348. [PMID: 28185585 PMCID: PMC5123339 DOI: 10.1186/s12859-016-1196-1] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND An important challenge in cancer biology is to understand the complex aspects of the disease. It is increasingly evident that genes are not isolated from each other and the comprehension of how different genes are related to each other could explain biological mechanisms causing diseases. Biological pathways are important tools to reveal gene interaction and reduce the large number of genes to be studied by partitioning it into smaller paths. Furthermore, recent scientific evidence has proven that a combination of pathways, instead than a single element of the pathway or a single pathway, could be responsible for pathological changes in a cell. RESULTS In this paper we develop a new method that can reveal miRNAs able to regulate, in a coordinated way, networks of gene pathways. We applied the method to subtypes of breast cancer. The basic idea is the identification of pathways significantly enriched with differentially expressed genes among the different breast cancer subtypes and normal tissue. Looking at the pairs of pathways that were found to be functionally related, we created a network of dependent pathways and we focused on identifying miRNAs that could act as miRNA drivers in a coordinated regulation process. CONCLUSIONS Our approach enables miRNAs identification that could have an important role in the development of breast cancer.
Collapse
Affiliation(s)
- Claudia Cava
- Institute of Molecular Bioimaging and Physiology (IBFM), National Research Council (CNR), Milan, Italy
| | - Antonio Colaprico
- Interuniversity Institute of Bioinformatics in Brussels (IB), Brussels, Belgium
- Machine Learning Group, ULB, Brussels, Belgium
| | - Gloria Bertoli
- Institute of Molecular Bioimaging and Physiology (IBFM), National Research Council (CNR), Milan, Italy
| | - Gianluca Bontempi
- Interuniversity Institute of Bioinformatics in Brussels (IB), Brussels, Belgium
- Machine Learning Group, ULB, Brussels, Belgium
| | - Giancarlo Mauri
- Department of Informatics, Systems and Communications, University of Milan–Bicocca, Milan, Italy
| | - Isabella Castiglioni
- Institute of Molecular Bioimaging and Physiology (IBFM), National Research Council (CNR), Milan, Italy
| |
Collapse
|
14
|
Winzer KJ, Buchholz A, Schumacher M, Sauerbrei W. Improving the Prognostic Ability through Better Use of Standard Clinical Data - The Nottingham Prognostic Index as an Example. PLoS One 2016; 11:e0149977. [PMID: 26938061 PMCID: PMC4777365 DOI: 10.1371/journal.pone.0149977] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2015] [Accepted: 02/08/2016] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND Prognostic factors and prognostic models play a key role in medical research and patient management. The Nottingham Prognostic Index (NPI) is a well-established prognostic classification scheme for patients with breast cancer. In a very simple way, it combines the information from tumor size, lymph node stage and tumor grade. For the resulting index cutpoints are proposed to classify it into three to six groups with different prognosis. As not all prognostic information from the three and other standard factors is used, we will consider improvement of the prognostic ability using suitable analysis approaches. METHODS AND FINDINGS Reanalyzing overall survival data of 1560 patients from a clinical database by using multivariable fractional polynomials and further modern statistical methods we illustrate suitable multivariable modelling and methods to derive and assess the prognostic ability of an index. Using a REMARK type profile we summarize relevant steps of the analysis. Adding the information from hormonal receptor status and using the full information from the three NPI components, specifically concerning the number of positive lymph nodes, an extended NPI with improved prognostic ability is derived. CONCLUSIONS The prognostic ability of even one of the best established prognostic index in medicine can be improved by using suitable statistical methodology to extract the full information from standard clinical data. This extended version of the NPI can serve as a benchmark to assess the added value of new information, ranging from a new single clinical marker to a derived index from omics data. An established benchmark would also help to harmonize the statistical analyses of such studies and protect against the propagation of many false promises concerning the prognostic value of new measurements. Statistical methods used are generally available and can be used for similar analyses in other diseases.
Collapse
Affiliation(s)
- Klaus-Jürgen Winzer
- Charité–Universitätsmedizin Berlin, Klinik für Gynäkologie mit Brustzentrum, Berlin, Germany
| | - Anika Buchholz
- Universitätsklinikum Freiburg, Institut für Medizinische Biometrie und Statistik, Department für Medizinische Biometrie und Medizinische Informatik, Freiburg, Germany
- Universitätsklinikum Hamburg-Eppendorf, Institut für Medizinische Biometrie und Epidemiologie, Hamburg, Germany
| | - Martin Schumacher
- Universitätsklinikum Freiburg, Institut für Medizinische Biometrie und Statistik, Department für Medizinische Biometrie und Medizinische Informatik, Freiburg, Germany
| | - Willi Sauerbrei
- Universitätsklinikum Freiburg, Institut für Medizinische Biometrie und Statistik, Department für Medizinische Biometrie und Medizinische Informatik, Freiburg, Germany
| |
Collapse
|
15
|
Gligorijević V, Pržulj N. Methods for biological data integration: perspectives and challenges. J R Soc Interface 2015; 12:20150571. [PMID: 26490630 PMCID: PMC4685837 DOI: 10.1098/rsif.2015.0571] [Citation(s) in RCA: 157] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2015] [Accepted: 09/25/2015] [Indexed: 12/17/2022] Open
Abstract
Rapid technological advances have led to the production of different types of biological data and enabled construction of complex networks with various types of interactions between diverse biological entities. Standard network data analysis methods were shown to be limited in dealing with such heterogeneous networked data and consequently, new methods for integrative data analyses have been proposed. The integrative methods can collectively mine multiple types of biological data and produce more holistic, systems-level biological insights. We survey recent methods for collective mining (integration) of various types of networked biological data. We compare different state-of-the-art methods for data integration and highlight their advantages and disadvantages in addressing important biological problems. We identify the important computational challenges of these methods and provide a general guideline for which methods are suited for specific biological problems, or specific data types. Moreover, we propose that recent non-negative matrix factorization-based approaches may become the integration methodology of choice, as they are well suited and accurate in dealing with heterogeneous data and have many opportunities for further development.
Collapse
Affiliation(s)
| | - Nataša Pržulj
- Department of Computing, Imperial College London, London SW7 2AZ, UK
| |
Collapse
|
16
|
Taskesen E, Babaei S, Reinders MMJ, de Ridder J. Integration of gene expression and DNA-methylation profiles improves molecular subtype classification in acute myeloid leukemia. BMC Bioinformatics 2015; 16 Suppl 4:S5. [PMID: 25734246 PMCID: PMC4347619 DOI: 10.1186/1471-2105-16-s4-s5] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Background Acute Myeloid Leukemia (AML) is characterized by various cytogenetic and molecular abnormalities. Detection of these abnormalities is important in the risk-classification of patients but requires laborious experimentation. Various studies showed that gene expression profiles (GEP), and the gene signatures derived from GEP, can be used for the prediction of subtypes in AML. Similarly, successful prediction was also achieved by exploiting DNA-methylation profiles (DMP). There are, however, no studies that compared classification accuracy and performance between GEP and DMP, neither are there studies that integrated both types of data to determine whether predictive power can be improved. Approach Here, we used 344 well-characterized AML samples for which both gene expression and DNA-methylation profiles are available. We created three different classification strategies including early, late and no integration of these datasets and used them to predict AML subtypes using a logistic regression model with Lasso regularization. Results We illustrate that both gene expression and DNA-methylation profiles contain distinct patterns that contribute to discriminating AML subtypes and that an integration strategy can exploit these patterns to achieve synergy between both data types. We show that concatenation of features from both data sets, i.e. early integration, improves the predictive power compared to classifiers trained on GEP or DMP alone. A more sophisticated strategy, i.e. the late integration strategy, employs a two-layer classifier which outperforms the early integration strategy. Conclusion We demonstrate that prediction of known cytogenetic and molecular abnormalities in AML can be further improved by integrating GEP and DMP profiles.
Collapse
|
17
|
Žitnik M, Zupan B. Data Fusion by Matrix Factorization. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2015; 37:41-53. [PMID: 26353207 DOI: 10.1109/tpami.2014.2343973] [Citation(s) in RCA: 90] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
For most problems in science and engineering we can obtain data sets that describe the observed system from various perspectives and record the behavior of its individual components. Heterogeneous data sets can be collectively mined by data fusion. Fusion can focus on a specific target relation and exploit directly associated data together with contextual data and data about system's constraints. In the paper we describe a data fusion approach with penalized matrix tri-factorization (DFMF) that simultaneously factorizes data matrices to reveal hidden associations. The approach can directly consider any data that can be expressed in a matrix, including those from feature-based representations, ontologies, associations and networks. We demonstrate the utility of DFMF for gene function prediction task with eleven different data sources and for prediction of pharmacologic actions by fusing six data sources. Our data fusion algorithm compares favorably to alternative data integration approaches and achieves higher accuracy than can be obtained from any single data source alone.
Collapse
|
18
|
Thomas M, De Brabanter K, Suykens JAK, De Moor B. Predicting breast cancer using an expression values weighted clinical classifier. BMC Bioinformatics 2014; 15:411. [PMID: 25551433 PMCID: PMC4308909 DOI: 10.1186/s12859-014-0411-1] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2013] [Accepted: 12/05/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Clinical data, such as patient history, laboratory analysis, ultrasound parameters-which are the basis of day-to-day clinical decision support-are often used to guide the clinical management of cancer in the presence of microarray data. Several data fusion techniques are available to integrate genomics or proteomics data, but only a few studies have created a single prediction model using both gene expression and clinical data. These studies often remain inconclusive regarding an obtained improvement in prediction performance. To improve clinical management, these data should be fully exploited. This requires efficient algorithms to integrate these data sets and design a final classifier. LS-SVM classifiers and generalized eigenvalue/singular value decompositions are successfully used in many bioinformatics applications for prediction tasks. While bringing up the benefits of these two techniques, we propose a machine learning approach, a weighted LS-SVM classifier to integrate two data sources: microarray and clinical parameters. RESULTS We compared and evaluated the proposed methods on five breast cancer case studies. Compared to LS-SVM classifier on individual data sets, generalized eigenvalue decomposition (GEVD) and kernel GEVD, the proposed weighted LS-SVM classifier offers good prediction performance, in terms of test area under ROC Curve (AUC), on all breast cancer case studies. CONCLUSIONS Thus a clinical classifier weighted with microarray data set results in significantly improved diagnosis, prognosis and prediction responses to therapy. The proposed model has been shown as a promising mathematical framework in both data fusion and non-linear classification problems.
Collapse
Affiliation(s)
- Minta Thomas
- KU Leuven, Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics/iMinds Future Health Department, Kasteelpark Arenberg 10, Leuven, 3001, Belgium.
| | - Kris De Brabanter
- Department of Statistics & Computer Science, Iowa State University, Ames, IA, USA.
| | - Johan A K Suykens
- KU Leuven, Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics/iMinds Future Health Department, Kasteelpark Arenberg 10, Leuven, 3001, Belgium.
| | - Bart De Moor
- KU Leuven, Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics/iMinds Future Health Department, Kasteelpark Arenberg 10, Leuven, 3001, Belgium.
| |
Collapse
|
19
|
Letzkus M, Luesink E, Starck-Schwertz S, Bigaud M, Mirza F, Hartmann N, Gerstmayer B, Janssen U, Scherer A, Schumacher MM, Verles A, Vitaliti A, Nirmala N, Johnson KJ, Staedtler F. Gene expression profiling of immunomagnetically separated cells directly from stabilized whole blood for multicenter clinical trials. Clin Transl Med 2014; 3:36. [PMID: 25984272 PMCID: PMC4424390 DOI: 10.1186/s40169-014-0036-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2014] [Accepted: 10/07/2014] [Indexed: 12/12/2022] Open
Abstract
Background Clinically useful biomarkers for patient stratification and monitoring of disease progression and drug response are in big demand in drug development and for addressing potential safety concerns. Many diseases influence the frequency and phenotype of cells found in the peripheral blood and the transcriptome of blood cells. Changes in cell type composition influence whole blood gene expression analysis results and thus the discovery of true transcript level changes remains a challenge. We propose a robust and reproducible procedure, which includes whole transcriptome gene expression profiling of major subsets of immune cell cells directly sorted from whole blood. Methods Target cells were enriched using magnetic microbeads and an autoMACS® Pro Separator (Miltenyi Biotec). Flow cytometric analysis for purity was performed before and after magnetic cell sorting. Total RNA was hybridized on HGU133 Plus 2.0 expression microarrays (Affymetrix, USA). CEL files signal intensity values were condensed using RMA and a custom CDF file (EntrezGene-based). Results Positive selection by use of MACS® Technology coupled to transcriptomics was assessed for eight different peripheral blood cell types, CD14+ monocytes, CD3+, CD4+, or CD8+ T cells, CD15+ granulocytes, CD19+ B cells, CD56+ NK cells, and CD45+ pan leukocytes. RNA quality from enriched cells was above a RIN of eight. GeneChip analysis confirmed cell type specific transcriptome profiles. Storing whole blood collected in an EDTA Vacutainer® tube at 4°C followed by MACS does not activate sorted cells. Gene expression analysis supports cell enrichment measurements by MACS. Conclusions The proposed workflow generates reproducible cell-type specific transcriptome data which can be translated to clinical settings and used to identify clinically relevant gene expression biomarkers from whole blood samples. This procedure enables the integration of transcriptomics of relevant immune cell subsets sorted directly from whole blood in clinical trial protocols.
Collapse
Affiliation(s)
- Martin Letzkus
- Biomarker Development, Novartis Institutes for BioMedical Research (NIBR), Basel, Switzerland
| | - Evert Luesink
- Biomarker Development, Novartis Institutes for BioMedical Research (NIBR), Basel, Switzerland
| | | | - Marc Bigaud
- Biomarker Development, Novartis Institutes for BioMedical Research (NIBR), Basel, Switzerland
| | - Fareed Mirza
- Scientific Capability Development, Pharma-Development, Novartis Pharma AG, Basel, Switzerland
| | - Nicole Hartmann
- Biomarker Development, Novartis Institutes for BioMedical Research (NIBR), Basel, Switzerland
| | | | - Uwe Janssen
- Miltenyi Biotec GmbH, Bergisch Gladbach, Germany
| | | | - Martin M Schumacher
- Biomarker Development, Novartis Institutes for BioMedical Research (NIBR), Basel, Switzerland
| | - Aurelie Verles
- Biomarker Development, Novartis Institutes for BioMedical Research (NIBR), Basel, Switzerland
| | - Alessandra Vitaliti
- Biomarker Development, Novartis Institutes for BioMedical Research (NIBR), Basel, Switzerland
| | - Nanguneri Nirmala
- Biomarker Development, Novartis Institutes for BioMedical Research (NIBR), Cambridge, MA, USA
| | - Keith J Johnson
- Biomarker Development, Novartis Institutes for BioMedical Research (NIBR), Cambridge, MA, USA
| | - Frank Staedtler
- Biomarker Development, Novartis Institutes for BioMedical Research (NIBR), Basel, Switzerland
| |
Collapse
|
20
|
Farhadian M, Mahjub H, Poorolajal J, Moghimbeigi A, Mansoorizadeh M. Predicting 5-Year Survival Status of Patients with Breast Cancer based on Supervised Wavelet Method. Osong Public Health Res Perspect 2014; 5:324-32. [PMID: 25562040 PMCID: PMC4281603 DOI: 10.1016/j.phrp.2014.09.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2014] [Revised: 09/15/2014] [Accepted: 09/22/2014] [Indexed: 12/16/2022] Open
Abstract
OBJECTIVES Classification of breast cancer patients into different risk classes is very important in clinical applications. It is estimated that the advent of high-dimensional gene expression data could improve patient classification. In this study, a new method for transforming the high-dimensional gene expression data in a low-dimensional space based on wavelet transform (WT) is presented. METHODS The proposed method was applied to three publicly available microarray data sets. After dimensionality reduction using supervised wavelet, a predictive support vector machine (SVM) model was built upon the reduced dimensional space. In addition, the proposed method was compared with the supervised principal component analysis (PCA). RESULTS The performance of supervised wavelet and supervised PCA based on selected genes were better than the signature genes identified in the other studies. Furthermore, the supervised wavelet method generally performed better than the supervised PCA for predicting the 5-year survival status of patients with breast cancer based on microarray data. In addition, the proposed method had a relatively acceptable performance compared with the other studies. CONCLUSION The results suggest the possibility of developing a new tool using wavelets for the dimension reduction of microarray data sets in the classification framework.
Collapse
Affiliation(s)
- Maryam Farhadian
- Department of Epidemiology and Biostatistics, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran
| | - Hossein Mahjub
- Research Center for Health Sciences and Department of Epidemiology and Biostatistics, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran
| | - Jalal Poorolajal
- Modeling of Noncommunicable Diseases Research Center, Department of Epidemiology and Biostatistics, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran
| | - Abbas Moghimbeigi
- Modeling of Noncommunicable Disease Research Center, Department of Biostatistics and Epidemiology, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran
| | - Muharram Mansoorizadeh
- Department of Computer Engineering, Faculty of Engineering, Bu-Ali Sina University, Hamadan, Iran
| |
Collapse
|
21
|
Žitnik M, Zupan B. Matrix factorization-based data fusion for drug-induced liver injury prediction. ACTA ACUST UNITED AC 2014. [DOI: 10.4161/sysb.29072] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
22
|
Huang S, Yee C, Ching T, Yu H, Garmire LX. A novel model to combine clinical and pathway-based transcriptomic information for the prognosis prediction of breast cancer. PLoS Comput Biol 2014; 10:e1003851. [PMID: 25233347 PMCID: PMC4168973 DOI: 10.1371/journal.pcbi.1003851] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2014] [Accepted: 08/08/2014] [Indexed: 01/19/2023] Open
Abstract
Breast cancer is the most common malignancy in women worldwide. With the increasing awareness of heterogeneity in breast cancers, better prediction of breast cancer prognosis is much needed for more personalized treatment and disease management. Towards this goal, we have developed a novel computational model for breast cancer prognosis by combining the Pathway Deregulation Score (PDS) based pathifier algorithm, Cox regression and L1-LASSO penalization method. We trained the model on a set of 236 patients with gene expression data and clinical information, and validated the performance on three diversified testing data sets of 606 patients. To evaluate the performance of the model, we conducted survival analysis of the dichotomized groups, and compared the areas under the curve based on the binary classification. The resulting prognosis genomic model is composed of fifteen pathways (e.g. P53 pathway) that had previously reported cancer relevance, and it successfully differentiated relapse in the training set (log rank p-value = 6.25e-12) and three testing data sets (log rank p-value<0.0005). Moreover, the pathway-based genomic models consistently performed better than gene-based models on all four data sets. We also find strong evidence that combining genomic information with clinical information improved the p-values of prognosis prediction by at least three orders of magnitude in comparison to using either genomic or clinical information alone. In summary, we propose a novel prognosis model that harnesses the pathway-based dysregulation as well as valuable clinical information. The selected pathways in our prognosis model are promising targets for therapeutic intervention. With the increasing awareness of heterogeneity in breast cancers, better prediction of breast cancer prognosis is much needed early on for more personalized treatment and management. Towards this goal we propose in this study a novel pathway-based prognosis prediction model, which emphasizes on individualized pathway-based risk measurement using the pathway dysregulation score (PDS). In combination with the L1-LASSO penalized feature selection and the COX-Proportional Hazards regression model, we have identified fifteen cancer relevant pathways using the pathway-based genomic model that successfully differentiated the relapse in the training set as well as three diversified test sets. Moreover, given the debate whether higher-order representative features, such as GO sets, pathways and network modules are superior to the gene-level features in the genomic models, we demonstrate that pathway-based genomic models consistently performed better than gene-based models in all four data sets. Last but not least, we show strong evidence that models that combine genomic information with clinical information improves the prognosis prediction significantly, in comparison to models that use either genomic or clinical information alone.
Collapse
Affiliation(s)
- Sijia Huang
- Molecular Biosciences and Bioengineering Graduate Program, University of Hawaii at Manoa, Honolulu, Hawaii, United States of America
- Epidemiology Program, University of Hawaii Cancer Center, Honolulu, Hawaii, United States of America
| | - Cameron Yee
- Neurobiology Program of Biology Department, University of Washington, Seattle, Washington, United States of America
| | - Travers Ching
- Molecular Biosciences and Bioengineering Graduate Program, University of Hawaii at Manoa, Honolulu, Hawaii, United States of America
- Epidemiology Program, University of Hawaii Cancer Center, Honolulu, Hawaii, United States of America
| | - Herbert Yu
- Epidemiology Program, University of Hawaii Cancer Center, Honolulu, Hawaii, United States of America
| | - Lana X. Garmire
- Molecular Biosciences and Bioengineering Graduate Program, University of Hawaii at Manoa, Honolulu, Hawaii, United States of America
- Epidemiology Program, University of Hawaii Cancer Center, Honolulu, Hawaii, United States of America
- * E-mail:
| |
Collapse
|
23
|
Identification of a prognostic signature for old-age mortality by integrating genome-wide transcriptomic data with the conventional predictors: the Vitality 90+ Study. BMC Med Genomics 2014; 7:54. [PMID: 25213707 PMCID: PMC4167306 DOI: 10.1186/1755-8794-7-54] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2014] [Accepted: 09/08/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Prediction models for old-age mortality have generally relied upon conventional markers such as plasma-based factors and biophysiological characteristics. However, it is unknown whether the existing markers are able to provide the most relevant information in terms of old-age survival or whether predictions could be improved through the integration of whole-genome expression profiles. METHODS We assessed the predictive abilities of survival models containing only conventional markers, only gene expression data or both types of data together in a Vitality 90+ study cohort consisting of n = 151 nonagenarians. The all-cause death rate was 32.5% (49 of 151 individuals), and the median follow-up time was 2.55 years. RESULTS Three different feature selection models, the penalized Lasso and Ridge regressions and the C-index boosting algorithm, were used to test the genomic data. The Ridge regression model incorporating both the conventional markers and transcripts outperformed the other models. The multivariate Cox regression model was used to adjust for the conventional mortality prediction markers, i.e., the body mass index, frailty index and cell-free DNA level, revealing that 331 transcripts were independently associated with survival. The final mortality-predicting transcriptomic signature derived from the Ridge regression model was mapped to a network that identified nuclear factor kappa beta (NF-κB) as a central node. CONCLUSIONS Together with the loss of physiological reserves, the transcriptomic predictors centered around NF-κB underscored the role of immunoinflammatory signaling, the control of the DNA damage response and cell cycle, and mitochondrial functions as the key determinants of old-age mortality.
Collapse
|
24
|
Discovering disease-disease associations by fusing systems-level molecular data. Sci Rep 2013; 3:3202. [PMID: 24232732 PMCID: PMC3828568 DOI: 10.1038/srep03202] [Citation(s) in RCA: 81] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2013] [Accepted: 10/23/2013] [Indexed: 12/12/2022] Open
Abstract
The advent of genome-scale genetic and genomic studies allows new insight into disease classification. Recently, a shift was made from linking diseases simply based on their shared genes towards systems-level integration of molecular data. Here, we aim to find relationships between diseases based on evidence from fusing all available molecular interaction and ontology data. We propose a multi-level hierarchy of disease classes that significantly overlaps with existing disease classification. In it, we find 14 disease-disease associations currently not present in Disease Ontology and provide evidence for their relationships through comorbidity data and literature curation. Interestingly, even though the number of known human genetic interactions is currently very small, we find they are the most important predictor of a link between diseases. Finally, we show that omission of any one of the included data sources reduces prediction quality, further highlighting the importance in the paradigm shift towards systems-level data fusion.
Collapse
|