1
|
Kundu P, Beura S, Mondal S, Das AK, Ghosh A. Machine learning for the advancement of genome-scale metabolic modeling. Biotechnol Adv 2024; 74:108400. [PMID: 38944218 DOI: 10.1016/j.biotechadv.2024.108400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Revised: 05/13/2024] [Accepted: 06/23/2024] [Indexed: 07/01/2024]
Abstract
Constraint-based modeling (CBM) has evolved as the core systems biology tool to map the interrelations between genotype, phenotype, and external environment. The recent advancement of high-throughput experimental approaches and multi-omics strategies has generated a plethora of new and precise information from wide-ranging biological domains. On the other hand, the continuously growing field of machine learning (ML) and its specialized branch of deep learning (DL) provide essential computational architectures for decoding complex and heterogeneous biological data. In recent years, both multi-omics and ML have assisted in the escalation of CBM. Condition-specific omics data, such as transcriptomics and proteomics, helped contextualize the model prediction while analyzing a particular phenotypic signature. At the same time, the advanced ML tools have eased the model reconstruction and analysis to increase the accuracy and prediction power. However, the development of these multi-disciplinary methodological frameworks mainly occurs independently, which limits the concatenation of biological knowledge from different domains. Hence, we have reviewed the potential of integrating multi-disciplinary tools and strategies from various fields, such as synthetic biology, CBM, omics, and ML, to explore the biochemical phenomenon beyond the conventional biological dogma. How the integrative knowledge of these intersected domains has improved bioengineering and biomedical applications has also been highlighted. We categorically explained the conventional genome-scale metabolic model (GEM) reconstruction tools and their improvement strategies through ML paradigms. Further, the crucial role of ML and DL in omics data restructuring for GEM development has also been briefly discussed. Finally, the case-study-based assessment of the state-of-the-art method for improving biomedical and metabolic engineering strategies has been elaborated. Therefore, this review demonstrates how integrating experimental and in silico strategies can help map the ever-expanding knowledge of biological systems driven by condition-specific cellular information. This multiview approach will elevate the application of ML-based CBM in the biomedical and bioengineering fields for the betterment of society and the environment.
Collapse
Affiliation(s)
- Pritam Kundu
- School School of Energy Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal 721302, India
| | - Satyajit Beura
- Department of Bioscience and Biotechnology, Indian Institute of Technology, Kharagpur, West Bengal 721302, India
| | - Suman Mondal
- P.K. Sinha Centre for Bioenergy and Renewables, Indian Institute of Technology Kharagpur, West Bengal 721302, India
| | - Amit Kumar Das
- Department of Bioscience and Biotechnology, Indian Institute of Technology, Kharagpur, West Bengal 721302, India
| | - Amit Ghosh
- School School of Energy Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal 721302, India; P.K. Sinha Centre for Bioenergy and Renewables, Indian Institute of Technology Kharagpur, West Bengal 721302, India.
| |
Collapse
|
2
|
Zapperi S, La Porta CAM. The Response of Triple-Negative Breast Cancer to Neoadjuvant Chemotherapy and the Epithelial–Mesenchymal Transition. Int J Mol Sci 2023; 24:ijms24076422. [PMID: 37047393 PMCID: PMC10094549 DOI: 10.3390/ijms24076422] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 03/25/2023] [Accepted: 03/28/2023] [Indexed: 04/03/2023] Open
Abstract
It would be highly desirable to find prognostic and predictive markers for triple-negative breast cancer (TNBC), a strongly heterogeneous and invasive breast cancer subtype often characterized by a high recurrence rate and a poor outcome. Here, we investigated the prognostic and predictive capabilities of ARIADNE, a recently developed transcriptomic test focusing on the epithelial–mesenchymal transition. We first compared the stratification of TNBC patients obtained by ARIADNE with that based on other common pathological indicators, such as grade, stage and nodal status, and found that ARIADNE was more effective than the other methods in dividing patients into groups with different disease-free survival statistics. Next, we considered the response to neoadjuvant chemotherapy and found that the classification provided by ARIADNE led to statistically significant differences in the rates of pathological complete response within the groups.
Collapse
|
3
|
La Porta CAM, Zapperi S. Artificial intelligence in breast cancer diagnostics. Cell Rep Med 2022; 3:100851. [PMID: 36543102 PMCID: PMC9798018 DOI: 10.1016/j.xcrm.2022.100851] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2022] [Revised: 11/10/2022] [Accepted: 11/14/2022] [Indexed: 12/24/2022]
Abstract
Since breast cancer deaths are mainly due to metastasis, predicting the risk that a primary tumor will develop metastasis after a first diagnosis is a central issue that could be addressed by artificial intelligence. To overcome the problem posed by limited availability of standardized datasets, algorithms should include biological insight.
Collapse
Affiliation(s)
- Caterina AM. La Porta
- Department of Environmental Science and Policy, Center for Complexity & Biosystems, University of Milan, via Celoria 10, 20133 Milan, Italy,CNR - Consiglio Nazionale delle Ricerche, Istituto di Biofisica, via Celoria 10, 20133 Milan, Italy,Corresponding author
| | - Stefano Zapperi
- Department of Physics, Center for Complexity & Biosystems, University of Milan, via Celoria 16, 20133 Milan, Italy,CNR - Consiglio Nazionale delle Ricerche, Istituto di Chimica della Materia Condensata e di Tecnologie per l'Energia, Via R. Cozzi 53, 20125 Milano, Italy
| |
Collapse
|
4
|
Classification of triple negative breast cancer by epithelial mesenchymal transition and the tumor immune microenvironment. Sci Rep 2022; 12:9651. [PMID: 35688895 PMCID: PMC9187759 DOI: 10.1038/s41598-022-13428-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2022] [Accepted: 05/09/2022] [Indexed: 11/22/2022] Open
Abstract
Triple-negative breast cancer (TNBC) accounts for about 15–20% of all breast cancers and differs from other invasive breast cancer types because it grows and spreads rapidly, it has limited treatment options and typically worse prognosis. Since TNBC does not express estrogen or progesterone receptors and little or no human epidermal growth factor receptor (HER2) proteins are present, hormone therapy and drugs targeting HER2 are not helpful, leaving chemotherapy only as the main systemic treatment option. In this context, it would be important to find molecular signatures able to stratify patients into high and low risk groups. This would allow oncologists to suggest the best therapeutic strategy in a personalized way, avoiding unnecessary toxicity and reducing the high costs of treatment. Here we compare two independent patient stratification strategies for TNBC based on gene expression data: The first is focusing on the epithelial mesenchymal transition (EMT) and the second on the tumor immune microenvironment. Our results show that the two stratification strategies are not directly related, suggesting that the aggressiveness of the tumor can be due to a multitude of unrelated factors. In particular, the EMT stratification is able to identify a high-risk population with high immune markers that is, however, not properly classified by the tumor immune microenvironment based strategy.
Collapse
|
5
|
Narykov O, Johnson NT, Korkin D. Predicting protein interaction network perturbation by alternative splicing with semi-supervised learning. Cell Rep 2021; 37:110045. [PMID: 34818539 DOI: 10.1016/j.celrep.2021.110045] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Revised: 07/21/2021] [Accepted: 11/02/2021] [Indexed: 10/19/2022] Open
Abstract
Alternative splicing introduces an additional layer of protein diversity and complexity in regulating cellular functions that can be specific to the tissue and cell type, physiological state of a cell, or disease phenotype. Recent high-throughput experimental studies have illuminated the functional role of splicing events through rewiring protein-protein interactions; however, the extent to which the macromolecular interactions are affected by alternative splicing has yet to be fully understood. In silico methods provide a fast and cheap alternative to interrogating functional characteristics of thousands of alternatively spliced isoforms. Here, we develop an accurate feature-based machine learning approach that predicts whether a protein-protein interaction carried out by a reference isoform is perturbed by an alternatively spliced isoform. Our method, called the alternatively spliced interactions prediction (ALT-IN) tool, is compared with the state-of-the-art PPI prediction tools and shows superior performance, achieving 0.92 in precision and recall values.
Collapse
Affiliation(s)
- Oleksandr Narykov
- Department of Computer Science, and Bioinformatics and Computational Biology Program, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Nathan T Johnson
- Department of Computer Science, and Bioinformatics and Computational Biology Program, Worcester Polytechnic Institute, Worcester, MA, USA; Harvard Program in Therapeutic Sciences, Harvard Medical School, and Breast Tumor Immunology Laboratory, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Dmitry Korkin
- Department of Computer Science, and Bioinformatics and Computational Biology Program, Worcester Polytechnic Institute, Worcester, MA, USA.
| |
Collapse
|
6
|
Tran KA, Kondrashova O, Bradley A, Williams ED, Pearson JV, Waddell N. Deep learning in cancer diagnosis, prognosis and treatment selection. Genome Med 2021; 13:152. [PMID: 34579788 PMCID: PMC8477474 DOI: 10.1186/s13073-021-00968-x] [Citation(s) in RCA: 301] [Impact Index Per Article: 100.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2020] [Accepted: 09/12/2021] [Indexed: 12/13/2022] Open
Abstract
Deep learning is a subdiscipline of artificial intelligence that uses a machine learning technique called artificial neural networks to extract patterns and make predictions from large data sets. The increasing adoption of deep learning across healthcare domains together with the availability of highly characterised cancer datasets has accelerated research into the utility of deep learning in the analysis of the complex biology of cancer. While early results are promising, this is a rapidly evolving field with new knowledge emerging in both cancer biology and deep learning. In this review, we provide an overview of emerging deep learning techniques and how they are being applied to oncology. We focus on the deep learning applications for omics data types, including genomic, methylation and transcriptomic data, as well as histopathology-based genomic inference, and provide perspectives on how the different data types can be integrated to develop decision support tools. We provide specific examples of how deep learning may be applied in cancer diagnosis, prognosis and treatment management. We also assess the current limitations and challenges for the application of deep learning in precision oncology, including the lack of phenotypically rich data and the need for more explainable deep learning models. Finally, we conclude with a discussion of how current obstacles can be overcome to enable future clinical utilisation of deep learning.
Collapse
Affiliation(s)
- Khoa A. Tran
- Department of Genetics and Computational Biology, QIMR Berghofer Medical Research Institute, Brisbane, 4006 Australia
- School of Biomedical Sciences, Faculty of Health, Queensland University of Technology (QUT), Brisbane, 4059 Australia
| | - Olga Kondrashova
- Department of Genetics and Computational Biology, QIMR Berghofer Medical Research Institute, Brisbane, 4006 Australia
| | - Andrew Bradley
- Faculty of Engineering, Queensland University of Technology (QUT), Brisbane, 4000 Australia
| | - Elizabeth D. Williams
- School of Biomedical Sciences, Faculty of Health, Queensland University of Technology (QUT), Brisbane, 4059 Australia
- Australian Prostate Cancer Research Centre - Queensland (APCRC-Q) and Queensland Bladder Cancer Initiative (QBCI), Brisbane, 4102 Australia
| | - John V. Pearson
- Department of Genetics and Computational Biology, QIMR Berghofer Medical Research Institute, Brisbane, 4006 Australia
| | - Nicola Waddell
- Department of Genetics and Computational Biology, QIMR Berghofer Medical Research Institute, Brisbane, 4006 Australia
| |
Collapse
|
7
|
Albaradei S, Thafar M, Alsaedi A, Van Neste C, Gojobori T, Essack M, Gao X. Machine learning and deep learning methods that use omics data for metastasis prediction. Comput Struct Biotechnol J 2021; 19:5008-5018. [PMID: 34589181 PMCID: PMC8450182 DOI: 10.1016/j.csbj.2021.09.001] [Citation(s) in RCA: 77] [Impact Index Per Article: 25.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2021] [Revised: 08/16/2021] [Accepted: 09/02/2021] [Indexed: 12/14/2022] Open
Abstract
Knowing metastasis is the primary cause of cancer-related deaths, incentivized research directed towards unraveling the complex cellular processes that drive the metastasis. Advancement in technology and specifically the advent of high-throughput sequencing provides knowledge of such processes. This knowledge led to the development of therapeutic and clinical applications, and is now being used to predict the onset of metastasis to improve diagnostics and disease therapies. In this regard, predicting metastasis onset has also been explored using artificial intelligence approaches that are machine learning, and more recently, deep learning-based. This review summarizes the different machine learning and deep learning-based metastasis prediction methods developed to date. We also detail the different types of molecular data used to build the models and the critical signatures derived from the different methods. We further highlight the challenges associated with using machine learning and deep learning methods, and provide suggestions to improve the predictive performance of such methods.
Collapse
Key Words
- AE, autoencoder
- ANN, Artificial Neural Network
- AUC, area under the curve
- Acc, Accuracy
- Artificial intelligence
- BC, Betweenness centrality
- BH, Benjamini-Hochberg
- BioGRID, Biological General Repository for Interaction Datasets
- CCP, compound covariate predictor
- CEA, Carcinoembryonic antigen
- CNN, convolution neural networks
- CV, cross-validation
- Cancer
- DBN, deep belief network
- DDBN, discriminative deep belief network
- DEGs, differentially expressed genes
- DIP, Database of Interacting Proteins
- DNN, Deep neural network
- DT, Decision Tree
- Deep learning
- EMT, epithelial-mesenchymal transition
- FC, fully connected
- GA, Genetic Algorithm
- GANs, generative adversarial networks
- GEO, Gene Expression Omnibus
- HCC, hepatocellular carcinoma
- HPRD, Human Protein Reference Database
- KNN, K-nearest neighbor
- L-SVM, linear SVM
- LIMMA, linear models for microarray data
- LOOCV, Leave-one-out cross-validation
- LR, Logistic Regression
- MCCV, Monte Carlo cross-validation
- MLP, multilayer perceptron
- Machine learning
- Metastasis
- NPV, negative predictive value
- PCA, Principal component analysis
- PPI, protein-protein interaction
- PPV, positive predictive value
- RC, ridge classifier
- RF, Random Forest
- RFE, recursive feature elimination
- RMA, robust multi‐array average
- RNN, recurrent neural networks
- SGD, stochastic gradient descent
- SMOTE, synthetic minority over-sampling technique
- SVM, Support Vector Machine
- Se, sensitivity
- Sp, specificity
- TCGA, The Cancer Genome Atlas
- k-CV, k-fold cross validation
- mRMR, minimum redundancy maximum relevance
Collapse
Affiliation(s)
- Somayah Albaradei
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
- King Abdulaziz University, Faculty of Computing and Information Technology, Jeddah, Saudi Arabia
| | - Maha Thafar
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
- Taif University, Collage of Computers and Information Technology, Taif, Saudi Arabia
| | - Asim Alsaedi
- King Saud bin Abdulaziz University for Health Sciences, Jeddah, Saudi Arabia
- King Abdulaziz Medical City, Jeddah, Saudi Arabia
| | - Christophe Van Neste
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Takashi Gojobori
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
- Biological and Environmental Sciences and Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Magbubah Essack
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| |
Collapse
|
8
|
Manjang K, Yli-Harja O, Dehmer M, Emmert-Streib F. Limitations of Explainability for Established Prognostic Biomarkers of Prostate Cancer. Front Genet 2021; 12:649429. [PMID: 34367234 PMCID: PMC8340016 DOI: 10.3389/fgene.2021.649429] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Accepted: 06/01/2021] [Indexed: 11/28/2022] Open
Abstract
High-throughput technologies do not only provide novel means for basic biological research but also for clinical applications in hospitals. For instance, the usage of gene expression profiles as prognostic biomarkers for predicting cancer progression has found widespread interest. Aside from predicting the progression of patients, it is generally believed that such prognostic biomarkers also provide valuable information about disease mechanisms and the underlying molecular processes that are causal for a disorder. However, the latter assumption has been challenged. In this paper, we study this problem for prostate cancer. Specifically, we investigate a large number of previously published prognostic signatures of prostate cancer based on gene expression profiles and show that none of these can provide unique information about the underlying disease etiology of prostate cancer. Hence, our analysis reveals that none of the studied signatures has a sensible biological meaning. Overall, this shows that all studied prognostic signatures are merely black-box models allowing sensible predictions of prostate cancer outcome but are not capable of providing causal explanations to enhance the understanding of prostate cancer.
Collapse
Affiliation(s)
- Kalifa Manjang
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
| | - Olli Yli-Harja
- Computational Systems Biology, Tampere University, Tampere, Finland.,Institute for Systems Biology, Seattle, WA, United States.,Faculty of Medicine and Health Technology, Institute of Biosciences and Medical Technology, Tampere University, Tampere, Finland
| | - Matthias Dehmer
- Department of Computer Science, Swiss Distance University of Applied Sciences, Brig, Switzerland.,Department of Mechatronics and Biomedical Computer Science, University for Health Sciences, Medical Informatics and Technology (UMIT), Hall, Austria.,College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Frank Emmert-Streib
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland.,Faculty of Medicine and Health Technology, Institute of Biosciences and Medical Technology, Tampere University, Tampere, Finland
| |
Collapse
|
9
|
Classification of triple-negative breast cancers through a Boolean network model of the epithelial-mesenchymal transition. Cell Syst 2021; 12:457-462.e4. [PMID: 33961788 DOI: 10.1016/j.cels.2021.04.007] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2020] [Revised: 03/25/2021] [Accepted: 04/13/2021] [Indexed: 12/29/2022]
Abstract
Predicting the metastasis risk in patients with a primary breast cancer tumor is of fundamental importance to decide the best therapeutic strategy in the framework of personalized medicine. Here, we present ARIADNE, a general algorithmic strategy to assess the risk of metastasis from transcriptomic data of patients with triple-negative breast cancer, a subtype of breast cancer with poorer prognosis with respect to the other subtypes. ARIADNE identifies hybrid epithelial/mesenchymal phenotypes by mapping gene expression data into the states of a Boolean network model of the epithelial-mesenchymal pathway. Using this mapping, it is possible to stratify patients according to their prognosis, as we show by validating the strategy with three independent cohorts of triple-negative breast cancer patients. Our strategy provides a prognostic tool that could be applied to other biologically relevant pathways, in order to estimate the metastatic risk for other breast cancer subtypes or other tumor types. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
|
10
|
Seifert S, Gundlach S, Junge O, Szymczak S. Integrating biological knowledge and gene expression data using pathway-guided random forests: a benchmarking study. Bioinformatics 2021; 36:4301-4308. [PMID: 32399562 PMCID: PMC7520048 DOI: 10.1093/bioinformatics/btaa483] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2019] [Revised: 03/13/2020] [Accepted: 05/05/2020] [Indexed: 12/12/2022] Open
Abstract
MOTIVATION High-throughput technologies allow comprehensive characterization of individuals on many molecular levels. However, training computational models to predict disease status based on omics data is challenging. A promising solution is the integration of external knowledge about structural and functional relationships into the modeling process. We compared four published random forest-based approaches using two simulation studies and nine experimental datasets. RESULTS The self-sufficient prediction error approach should be applied when large numbers of relevant pathways are expected. The competing methods hunting and learner of functional enrichment should be used when low numbers of relevant pathways are expected or the most strongly associated pathways are of interest. The hybrid approach synthetic features is not recommended because of its high false discovery rate. AVAILABILITY AND IMPLEMENTATION An R package providing functions for data analysis and simulation is available at GitHub (https://github.com/szymczak-lab/PathwayGuidedRF). An accompanying R data package (https://github.com/szymczak-lab/DataPathwayGuidedRF) stores the processed and quality controlled experimental datasets downloaded from Gene Expression Omnibus (GEO). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Stephan Seifert
- Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein, Kiel 24105, Germany
| | - Sven Gundlach
- Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein, Kiel 24105, Germany
| | - Olaf Junge
- Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein, Kiel 24105, Germany
| | - Silke Szymczak
- Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein, Kiel 24105, Germany
| |
Collapse
|
11
|
Yoosuf N, Navarro JF, Salmén F, Ståhl PL, Daub CO. Identification and transfer of spatial transcriptomics signatures for cancer diagnosis. Breast Cancer Res 2020; 22:6. [PMID: 31931856 PMCID: PMC6958738 DOI: 10.1186/s13058-019-1242-9] [Citation(s) in RCA: 45] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2019] [Accepted: 12/27/2019] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Distinguishing ductal carcinoma in situ (DCIS) from invasive ductal carcinoma (IDC) regions in clinical biopsies constitutes a diagnostic challenge. Spatial transcriptomics (ST) is an in situ capturing method, which allows quantification and visualization of transcriptomes in individual tissue sections. In the past, studies have shown that breast cancer samples can be used to study their transcriptomes with spatial resolution in individual tissue sections. Previously, supervised machine learning methods were used in clinical studies to predict the clinical outcomes for cancer types. METHODS We used four publicly available ST breast cancer datasets from breast tissue sections annotated by pathologists as non-malignant, DCIS, or IDC. We trained and tested a machine learning method (support vector machine) based on the expert annotation as well as based on automatic selection of cell types by their transcriptome profiles. RESULTS We identified expression signatures for expert annotated regions (non-malignant, DCIS, and IDC) and build machine learning models. Classification results for 798 expression signature transcripts showed high coincidence with the expert pathologist annotation for DCIS (100%) and IDC (96%). Extending our analysis to include all 25,179 expressed transcripts resulted in an accuracy of 99% for DCIS and 98% for IDC. Further, classification based on an automatically identified expression signature covering all ST spots of tissue sections resulted in prediction accuracy of 95% for DCIS and 91% for IDC. CONCLUSIONS This concept study suggest that the ST signatures learned from expert selected breast cancer tissue sections can be used to identify breast cancer regions in whole tissue sections including regions not trained on. Furthermore, the identified expression signatures can classify cancer regions in tissue sections not used for training with high accuracy. Expert-generated but even automatically generated cancer signatures from ST data might be able to classify breast cancer regions and provide clinical decision support for pathologists in the future.
Collapse
Affiliation(s)
- Niyaz Yoosuf
- Department of Biosciences and Nutrition, Karolinska Institutet, 141 83, Huddinge, Sweden. .,Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, Stockholm, Sweden.
| | - José Fernández Navarro
- Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Fredrik Salmén
- Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, Stockholm, Sweden.,Hubrecht Institute-KNAW (Royal Netherlands Academy of Arts and Sciences) and University Medical Center Utrecht, Cancer Genomics Netherlands, Utrecht, the Netherlands
| | - Patrik L Ståhl
- Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Carsten O Daub
- Department of Biosciences and Nutrition, Karolinska Institutet, 141 83, Huddinge, Sweden.
| |
Collapse
|
12
|
DeepCC: a novel deep learning-based framework for cancer molecular subtype classification. Oncogenesis 2019; 8:44. [PMID: 31420533 PMCID: PMC6697729 DOI: 10.1038/s41389-019-0157-8] [Citation(s) in RCA: 95] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2018] [Revised: 05/13/2019] [Accepted: 06/27/2019] [Indexed: 12/20/2022] Open
Abstract
Molecular subtyping of cancer is a critical step towards more individualized therapy and provides important biological insights into cancer heterogeneity. Although gene expression signature-based classification has been widely demonstrated to be an effective approach in the last decade, the widespread implementation has long been limited by platform differences, batch effects, and the difficulty to classify individual patient samples. Here, we describe a novel supervised cancer classification framework, deep cancer subtype classification (DeepCC), based on deep learning of functional spectra quantifying activities of biological pathways. In two case studies about colorectal and breast cancer classification, DeepCC classifiers and DeepCC single sample predictors both achieved overall higher sensitivity, specificity, and accuracy compared with other widely used classification methods such as random forests (RF), support vector machine (SVM), gradient boosting machine (GBM), and multinomial logistic regression algorithms. Simulation analysis based on random subsampling of genes demonstrated the robustness of DeepCC to missing data. Moreover, deep features learned by DeepCC captured biological characteristics associated with distinct molecular subtypes, enabling more compact within-subtype distribution and between-subtype separation of patient samples, and therefore greatly reduce the number of unclassifiable samples previously. In summary, DeepCC provides a novel cancer classification framework that is platform independent, robust to missing data, and can be used for single sample prediction facilitating clinical implementation of cancer molecular subtyping.
Collapse
|
13
|
Wang W, Kandimalla R, Huang H, Zhu L, Li Y, Gao F, Goel A, Wang X. Molecular subtyping of colorectal cancer: Recent progress, new challenges and emerging opportunities. Semin Cancer Biol 2019; 55:37-52. [PMID: 29775690 PMCID: PMC6240404 DOI: 10.1016/j.semcancer.2018.05.002] [Citation(s) in RCA: 115] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2017] [Revised: 05/13/2018] [Accepted: 05/14/2018] [Indexed: 12/13/2022]
Abstract
Colorectal cancer (CRC) is one of the leading causes of cancer-related deaths worldwide. Similar to many other malignancies, CRC is a heterogeneous disease, making it a clinical challenge for optimization of treatment modalities in reducing the morbidity and mortality associated with this disease. A more precise understanding of the biological properties that distinguish patients with colorectal tumors, especially in terms of their clinical features, is a key requirement towards a more robust, targeted-drug design, and implementation of individualized therapies. In the recent decades, extensive studies have reported distinct CRC subtypes, with a mutation-centered view of tumor heterogeneity. However, more recently, the paradigm has shifted towards transcriptome-based classifications, represented by six independent CRC taxonomies. In 2015, the colorectal cancer subtyping consortium reported the identification of four consensus molecular subtypes (CMSs), providing thus far the most robust classification system for CRC. In this review, we summarize the historical timeline of CRC classification approaches; discuss their salient features and potential limitations that may require further refinement in near future. In other words, in spite of the recent encouraging progress, several major challenges prevent translation of molecular knowledge gleaned from CMSs into the clinic. Herein, we summarize some of these potential challenges and discuss exciting new opportunities currently emerging in related fields. We believe, close collaborations between basic researchers, bioinformaticians and clinicians are imperative for addressing these challenges, and eventually paving the path for CRC subtyping into routine clinical practice as we usher into the era of personalized medicine.
Collapse
Affiliation(s)
- Wei Wang
- Department of Biomedical Sciences, City University of Hong Kong, Hong Kong
| | - Raju Kandimalla
- Center for Gastrointestinal Research, Center for Translational Genomics and Oncology, Baylor Scott & White Research Institute and Charles A Sammons Cancer Center, Baylor Research Institute and Sammons Cancer Center, Baylor University Medical Center, 3410 Worth Street, Suite 610, Dallas, TX 75246, USA
| | - Hao Huang
- College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong
| | - Lina Zhu
- College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong
| | - Ying Li
- Department of Biomedical Sciences, City University of Hong Kong, Hong Kong
| | - Feng Gao
- College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong
| | - Ajay Goel
- Center for Gastrointestinal Research, Center for Translational Genomics and Oncology, Baylor Scott & White Research Institute and Charles A Sammons Cancer Center, Baylor Research Institute and Sammons Cancer Center, Baylor University Medical Center, 3410 Worth Street, Suite 610, Dallas, TX 75246, USA.
| | - Xin Wang
- Department of Biomedical Sciences, City University of Hong Kong, Hong Kong.
| |
Collapse
|
14
|
Giraud P, Giraud P, Gasnier A, El Ayachy R, Kreps S, Foy JP, Durdux C, Huguet F, Burgun A, Bibault JE. Radiomics and Machine Learning for Radiotherapy in Head and Neck Cancers. Front Oncol 2019; 9:174. [PMID: 30972291 PMCID: PMC6445892 DOI: 10.3389/fonc.2019.00174] [Citation(s) in RCA: 60] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2018] [Accepted: 02/28/2019] [Indexed: 12/13/2022] Open
Abstract
Introduction: An increasing number of parameters can be considered when making decisions in oncology. Tumor characteristics can also be extracted from imaging through the use of radiomics and add to this wealth of clinical data. Machine learning can encompass these parameters and thus enhance clinical decision as well as radiotherapy workflow. Methods: We performed a description of machine learning applications at each step of treatment by radiotherapy in head and neck cancers. We then performed a systematic review on radiomics and machine learning outcome prediction models in head and neck cancers. Results: Machine Learning has several promising applications in treatment planning with automatic organ at risk delineation improvements and adaptative radiotherapy workflow automation. It may also provide new approaches for Normal Tissue Complication Probability models. Radiomics may provide additional data on tumors for improved machine learning powered predictive models, not only on survival, but also on risk of distant metastasis, in field recurrence, HPV status and extra nodal spread. However, most studies provide preliminary data requiring further validation. Conclusion: Promising perspectives arise from machine learning applications and radiomics based models, yet further data are necessary for their implementation in daily care.
Collapse
Affiliation(s)
- Paul Giraud
- Radiation Oncology Department, Georges Pompidou European Hospital, Assistance Publique-Hôpitaux de Paris, Paris Descartes University, Paris Sorbonne Cité, Paris, France.,Cancer Research and Personalized Medicine-Integrated Cancer Research Center (SIRIC), Georges Pompidou European Hospital, Assistance Publique-Hôitaux de Paris, Paris Descartes University, Paris Sorbonne Cité, Paris, France
| | - Philippe Giraud
- Radiation Oncology Department, Georges Pompidou European Hospital, Assistance Publique-Hôpitaux de Paris, Paris Descartes University, Paris Sorbonne Cité, Paris, France.,Cancer Research and Personalized Medicine-Integrated Cancer Research Center (SIRIC), Georges Pompidou European Hospital, Assistance Publique-Hôitaux de Paris, Paris Descartes University, Paris Sorbonne Cité, Paris, France
| | - Anne Gasnier
- Radiation Oncology Department, Georges Pompidou European Hospital, Assistance Publique-Hôpitaux de Paris, Paris Descartes University, Paris Sorbonne Cité, Paris, France.,Cancer Research and Personalized Medicine-Integrated Cancer Research Center (SIRIC), Georges Pompidou European Hospital, Assistance Publique-Hôitaux de Paris, Paris Descartes University, Paris Sorbonne Cité, Paris, France
| | - Radouane El Ayachy
- Radiation Oncology Department, Georges Pompidou European Hospital, Assistance Publique-Hôpitaux de Paris, Paris Descartes University, Paris Sorbonne Cité, Paris, France.,Cancer Research and Personalized Medicine-Integrated Cancer Research Center (SIRIC), Georges Pompidou European Hospital, Assistance Publique-Hôitaux de Paris, Paris Descartes University, Paris Sorbonne Cité, Paris, France
| | - Sarah Kreps
- Radiation Oncology Department, Georges Pompidou European Hospital, Assistance Publique-Hôpitaux de Paris, Paris Descartes University, Paris Sorbonne Cité, Paris, France.,Cancer Research and Personalized Medicine-Integrated Cancer Research Center (SIRIC), Georges Pompidou European Hospital, Assistance Publique-Hôitaux de Paris, Paris Descartes University, Paris Sorbonne Cité, Paris, France
| | - Jean-Philippe Foy
- Department of Oral and Maxillo-Facial Surgery, Sorbonne University, Pitié-Salpêtriére Hospital, Paris, France.,Univ Lyon, Université Claude Bernard Lyon 1, INSERM 1052, CNRS 5286, Centre Léon Bérard, Centre de Recherche en Cancérologie de Lyon, Lyon, France
| | - Catherine Durdux
- Radiation Oncology Department, Georges Pompidou European Hospital, Assistance Publique-Hôpitaux de Paris, Paris Descartes University, Paris Sorbonne Cité, Paris, France.,Cancer Research and Personalized Medicine-Integrated Cancer Research Center (SIRIC), Georges Pompidou European Hospital, Assistance Publique-Hôitaux de Paris, Paris Descartes University, Paris Sorbonne Cité, Paris, France
| | - Florence Huguet
- Department of Radiation Oncology, Tenon University Hospital, Hôpitaux Universitaires Est Parisien, Sorbonne University Medical Faculty, Paris, France
| | - Anita Burgun
- Cancer Research and Personalized Medicine-Integrated Cancer Research Center (SIRIC), Georges Pompidou European Hospital, Assistance Publique-Hôitaux de Paris, Paris Descartes University, Paris Sorbonne Cité, Paris, France.,INSERM UMR 1138 Team 22: Information Sciences to support Personalized Medicine, Paris Descartes University, Sorbonne Paris Cité, Paris, France
| | - Jean-Emmanuel Bibault
- Radiation Oncology Department, Georges Pompidou European Hospital, Assistance Publique-Hôpitaux de Paris, Paris Descartes University, Paris Sorbonne Cité, Paris, France.,Cancer Research and Personalized Medicine-Integrated Cancer Research Center (SIRIC), Georges Pompidou European Hospital, Assistance Publique-Hôitaux de Paris, Paris Descartes University, Paris Sorbonne Cité, Paris, France.,INSERM UMR 1138 Team 22: Information Sciences to support Personalized Medicine, Paris Descartes University, Sorbonne Paris Cité, Paris, France
| |
Collapse
|
15
|
Degenhardt F, Seifert S, Szymczak S. Evaluation of variable selection methods for random forests and omics data sets. Brief Bioinform 2019; 20:492-503. [PMID: 29045534 PMCID: PMC6433899 DOI: 10.1093/bib/bbx124] [Citation(s) in RCA: 262] [Impact Index Per Article: 52.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2017] [Revised: 09/06/2017] [Indexed: 12/28/2022] Open
Abstract
Machine learning methods and in particular random forests are promising approaches for prediction based on high dimensional omics data sets. They provide variable importance measures to rank predictors according to their predictive power. If building a prediction model is the main goal of a study, often a minimal set of variables with good prediction performance is selected. However, if the objective is the identification of involved variables to find active networks and pathways, approaches that aim to select all relevant variables should be preferred. We evaluated several variable selection procedures based on simulated data as well as publicly available experimental methylation and gene expression data. Our comparison included the Boruta algorithm, the Vita method, recurrent relative variable importance, a permutation approach and its parametric variant (Altmann) as well as recursive feature elimination (RFE). In our simulation studies, Boruta was the most powerful approach, followed closely by the Vita method. Both approaches demonstrated similar stability in variable selection, while Vita was the most robust approach under a pure null model without any predictor variables related to the outcome. In the analysis of the different experimental data sets, Vita demonstrated slightly better stability in variable selection and was less computationally intensive than Boruta. In conclusion, we recommend the Boruta and Vita approaches for the analysis of high-dimensional data sets. Vita is considerably faster than Boruta and thus more suitable for large data sets, but only Boruta can also be applied in low-dimensional settings.
Collapse
Affiliation(s)
| | - Stephan Seifert
- Institute of Medical Informatics and Statistics, Kiel University, Germany
| | - Silke Szymczak
- Institute of Medical Informatics and Statistics, Kiel University, Germany
| |
Collapse
|
16
|
La Porta CAM, Zapperi S. Explaining the dynamics of tumor aggressiveness: At the crossroads between biology, artificial intelligence and complex systems. Semin Cancer Biol 2018; 53:42-47. [PMID: 30017637 DOI: 10.1016/j.semcancer.2018.07.003] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2018] [Revised: 06/28/2018] [Accepted: 07/09/2018] [Indexed: 01/08/2023]
Abstract
Facing metastasis is the most pressing challenge of cancer research. In this review, we discuss recent advances in understanding phenotypic plasticity of cancer cells, highlighting the kinetics of cancer stem cell and the role of the epithelial mesenchymal transition for metastasis. It appears that the tumor micro-environment plays a crucial role in triggering phenotypic transitions, as we illustrate discussing the challenges posed by macrophages and cancer associated fibroblasts. To disentangle the complexity of environmentally induced phenotypic transitions, there is a growing need for novel advanced algorithms as those proposed in our recent work combining single cell data analysis and numerical simulations of gene regulatory networks. We conclude discussing recent developments in artificial intelligence and its applications to personalized cancer treatment.
Collapse
Affiliation(s)
- Caterina A M La Porta
- Center for Complexity and Biosystems, University of Milan, via Celoria 16, 20133 Milano, Italy; Department of Environmental Science and Policy, University of Milan, via Celoria 26, 20133 Milano, Italy.
| | - Stefano Zapperi
- Center for Complexity and Biosystems, University of Milan, via Celoria 16, 20133 Milano, Italy; Department of Physics, University of Milan, via Celoria 16, 20133 Milano, Italy; CNR - Consiglio Nazionale delle Ricerche, ICMATE, Via R. Cozzi 53, 20125 Milano, Italy
| |
Collapse
|
17
|
Dorman SN, Baranova K, Knoll JHM, Urquhart BL, Mariani G, Carcangiu ML, Rogan PK. Genomic signatures for paclitaxel and gemcitabine resistance in breast cancer derived by machine learning. Mol Oncol 2015; 10:85-100. [PMID: 26372358 DOI: 10.1016/j.molonc.2015.07.006] [Citation(s) in RCA: 87] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2015] [Accepted: 07/31/2015] [Indexed: 12/21/2022] Open
Abstract
Increasingly, the effectiveness of adjuvant chemotherapy agents for breast cancer has been related to changes in the genomic profile of tumors. We investigated correspondence between growth inhibitory concentrations of paclitaxel and gemcitabine (GI50) and gene copy number, mutation, and expression first in breast cancer cell lines and then in patients. Genes encoding direct targets of these drugs, metabolizing enzymes, transporters, and those previously associated with chemoresistance to paclitaxel (n = 31 genes) or gemcitabine (n = 18) were analyzed. A multi-factorial, principal component analysis (MFA) indicated expression was the strongest indicator of sensitivity for paclitaxel, and copy number and expression were informative for gemcitabine. The factors were combined using support vector machines (SVM). Expression of 15 genes (ABCC10, BCL2, BCL2L1, BIRC5, BMF, FGF2, FN1, MAP4, MAPT, NFKB2, SLCO1B3, TLR6, TMEM243, TWIST1, and CSAG2) predicted cell line sensitivity to paclitaxel with 82% accuracy. Copy number profiles of 3 genes (ABCC10, NT5C, TYMS) together with expression of 7 genes (ABCB1, ABCC10, CMPK1, DCTD, NME1, RRM1, RRM2B), predicted gemcitabine response with 85% accuracy. Expression and copy number studies of two independent sets of patients with known responses were then analyzed with these models. These included tumor blocks from 21 patients that were treated with both paclitaxel and gemcitabine, and 319 patients on paclitaxel and anthracycline therapy. A new paclitaxel SVM was derived from an 11-gene subset since data for 4 of the original genes was unavailable. The accuracy of this SVM was similar in cell lines and tumor blocks (70-71%). The gemcitabine SVM exhibited 62% prediction accuracy for the tumor blocks due to the presence of samples with poor nucleic acid integrity. Nevertheless, the paclitaxel SVM predicted sensitivity in 84% of patients with no or minimal residual disease.
Collapse
Affiliation(s)
- Stephanie N Dorman
- Department of Biochemistry, Schulich School of Medicine and Dentistry, University of Western Ontario, London, ON, Canada
| | - Katherina Baranova
- Department of Biochemistry, Schulich School of Medicine and Dentistry, University of Western Ontario, London, ON, Canada
| | - Joan H M Knoll
- Department of Pathology and Laboratory Medicine, Schulich School of Medicine and Dentistry, University of Western Ontario, London, ON, Canada; Molecular Diagnostics Division, Laboratory Medicine Program, London Health Sciences Centre, ON, Canada; Cytognomix Inc., London, ON, Canada
| | - Brad L Urquhart
- Department of Physiology and Pharmacology, Schulich School of Medicine and Dentistry, University of Western Ontario, London, ON, Canada
| | - Gabriella Mariani
- Department of Medical Oncology, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy
| | - Maria Luisa Carcangiu
- Department of Diagnostic and Laboratory Pathology, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy
| | - Peter K Rogan
- Department of Biochemistry, Schulich School of Medicine and Dentistry, University of Western Ontario, London, ON, Canada; Cytognomix Inc., London, ON, Canada; Department of Computer Science, University of Western Ontario, London, ON, Canada; Department of Oncology, Schulich School of Medicine and Dentistry, University of Western Ontario, London, ON, Canada.
| |
Collapse
|
18
|
Veytsman B, Wang L, Cui T, Bruskin S, Baranova A. Distance-based classifiers as potential diagnostic and prediction tools for human diseases. BMC Genomics 2015; 15 Suppl 12:S10. [PMID: 25563076 PMCID: PMC4303935 DOI: 10.1186/1471-2164-15-s12-s10] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Typically, gene expression biomarkers are being discovered in course of high-throughput experiments, for example, RNAseq or microarray profiling. Analytic pipelines that extract so-called signatures suffer from the "Dimensionality curse": the number of genes expressed exceeds the number of patients we can enroll in the study and use to train the discriminator algorithm. Hence, problems with the reproducibility of gene signatures are more common than not; when the algorithm is executed using a different training set, the resulting diagnostic signature may turn out to be completely different. In this paper we propose an alternative novel approach which takes into account quantifiable expression levels of all genes assayed. In our analysis, the cumulative gene expression pattern of an individual patient is represented as a point in the multidimensional space formed by all gene expression profiles assayed in given system, where the clusters of "normal samples" and "affected samples" and defined. The degree of separation of the given sample from the space occupied by "normal samples" reflects the drift of the sample away from homeostasis in the course of development of the pathophysiological process that underly the disease. The outlined approach was validated using the publicly available glioma dataset deposited in Rembrandt and associated with survival data. Additionally, the applicability of the distance analysis to the classification of non-malignant sampled was tested using psoriatic lesions and non-lesional matched controls as a model. Keywords: biomarkers; clustering; human diseases; RNA
Collapse
|
19
|
Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J 2014; 13:8-17. [PMID: 25750696 PMCID: PMC4348437 DOI: 10.1016/j.csbj.2014.11.005] [Citation(s) in RCA: 1160] [Impact Index Per Article: 116.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Cancer has been characterized as a heterogeneous disease consisting of many different subtypes. The early diagnosis and prognosis of a cancer type have become a necessity in cancer research, as it can facilitate the subsequent clinical management of patients. The importance of classifying cancer patients into high or low risk groups has led many research teams, from the biomedical and the bioinformatics field, to study the application of machine learning (ML) methods. Therefore, these techniques have been utilized as an aim to model the progression and treatment of cancerous conditions. In addition, the ability of ML tools to detect key features from complex datasets reveals their importance. A variety of these techniques, including Artificial Neural Networks (ANNs), Bayesian Networks (BNs), Support Vector Machines (SVMs) and Decision Trees (DTs) have been widely applied in cancer research for the development of predictive models, resulting in effective and accurate decision making. Even though it is evident that the use of ML methods can improve our understanding of cancer progression, an appropriate level of validation is needed in order for these methods to be considered in the everyday clinical practice. In this work, we present a review of recent ML approaches employed in the modeling of cancer progression. The predictive models discussed here are based on various supervised ML techniques as well as on different input features and data samples. Given the growing trend on the application of ML methods in cancer research, we present here the most recent publications that employ these techniques as an aim to model cancer risk or patient outcomes.
Collapse
Key Words
- ANN, Artificial Neural Network
- AUC, Area Under Curve
- BCRSVM, Breast Cancer Support Vector Machine
- BN, Bayesian Network
- CFS, Correlation based Feature Selection
- Cancer recurrence
- Cancer survival
- Cancer susceptibility
- DT, Decision Tree
- ES, Early Stopping algorithm
- GEO, Gene Expression Omnibus
- HTT, High-throughput Technologies
- LCS, Learning Classifying Systems
- ML, Machine Learning
- Machine learning
- NCI caArray, National Cancer Institute Array Data Management System
- NSCLC, Non-small Cell Lung Cancer
- OSCC, Oral Squamous Cell Carcinoma
- PPI, Protein–Protein Interaction
- Predictive models
- ROC, Receiver Operating Characteristic
- SEER, Surveillance, Epidemiology and End results Database
- SSL, Semi-supervised Learning
- SVM, Support Vector Machine
- TCGA, The Cancer Genome Atlas Research Network
Collapse
Affiliation(s)
- Konstantina Kourou
- Unit of Medical Technology and Intelligent Information Systems, Dept. of Materials Science and Engineering, University of Ioannina, Ioannina, Greece
| | - Themis P Exarchos
- Unit of Medical Technology and Intelligent Information Systems, Dept. of Materials Science and Engineering, University of Ioannina, Ioannina, Greece ; IMBB - FORTH, Dept. of Biomedical Research, Ioannina, Greece
| | - Konstantinos P Exarchos
- Unit of Medical Technology and Intelligent Information Systems, Dept. of Materials Science and Engineering, University of Ioannina, Ioannina, Greece
| | - Michalis V Karamouzis
- Molecular Oncology Unit, Department of Biological Chemistry, Medical School, University of Athens, Athens, Greece
| | - Dimitrios I Fotiadis
- Unit of Medical Technology and Intelligent Information Systems, Dept. of Materials Science and Engineering, University of Ioannina, Ioannina, Greece ; IMBB - FORTH, Dept. of Biomedical Research, Ioannina, Greece
| |
Collapse
|
20
|
Rivas-Perea P, Baker E, Hamerly G, Shaw BF. Detection of leukocoria using a soft fusion of expert classifiers under non-clinical settings. BMC Ophthalmol 2014; 14:110. [PMID: 25204762 PMCID: PMC4167153 DOI: 10.1186/1471-2415-14-110] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2014] [Accepted: 08/21/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Leukocoria is defined as a white reflection and its manifestation is symptomatic of several ocular pathologies, including retinoblastoma (Rb). Early detection of recurrent leukocoria is critical for improved patient outcomes and can be accomplished via the examination of recreational photography. To date, there exists a paucity of methods to automate leukocoria detection within such a dataset. METHODS This research explores a novel classification scheme that uses fuzzy logic theory to combine a number of classifiers that are experts in performing multichannel detection of leukocoria from recreational photography. The proposed scheme extracts features aided by the discrete cosine transform and the Karhunen-Loeve transformation. RESULTS The soft fusion of classifiers is significantly better than other methods of combining classifiers with p = 1.12 × 10-5. The proposed methodology performs at a 92% accuracy rate, with an 89% true positive rate, and an 11% false positive rate. Furthermore, the results produced by our methodology exhibit the lowest average variance. CONCLUSIONS The proposed methodology overcomes non-ideal conditions of image acquisition, presenting a competent approach for the detection of leukocoria. Results suggest that recreational photography can be used in combination with the fusion of individual experts in multichannel classification and preprocessing tools such as the discrete cosine transform and the Karhunen-Loeve transformation.
Collapse
Affiliation(s)
- Pablo Rivas-Perea
- Department of Computer Science, Baylor University, One Bear Place #97356, Waco, TX 76798-7356, USA.
| | | | | | | |
Collapse
|
21
|
Endesfelder D, Burrell R, Kanu N, McGranahan N, Howell M, Parker PJ, Downward J, Swanton C, Kschischo M. Chromosomal instability selects gene copy-number variants encoding core regulators of proliferation in ER+ breast cancer. Cancer Res 2014; 74:4853-4863. [PMID: 24970479 PMCID: PMC4167338 DOI: 10.1158/0008-5472.can-13-2664] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Chromosomal instability (CIN) is associated with poor outcome in epithelial malignancies, including breast carcinomas. Evidence suggests that prognostic signatures in estrogen receptor-positive (ER(+)) breast cancer define tumors with CIN and high proliferative potential. Intriguingly, CIN induction in lower eukaryotic cells and human cells is context dependent, typically resulting in a proliferation disadvantage but conferring a fitness benefit under strong selection pressures. We hypothesized that CIN permits accelerated genomic evolution through the generation of diverse DNA copy-number events that may be selected during disease development. In support of this hypothesis, we found evidence for selection of gene amplification of core regulators of proliferation in CIN-associated cancer genomes. Stable DNA copy-number amplifications of the core regulators TPX2 and UBE2C were associated with expression of a gene module involved in proliferation. The module genes were enriched within prognostic signature gene sets for ER(+) breast cancer, providing a logical connection between CIN and prognostic signature expression. Our results provide a framework to decipher the impact of intratumor heterogeneity on key cancer phenotypes, and they suggest that CIN provides a permissive landscape for selection of copy-number alterations that drive cancer proliferation.
Collapse
Affiliation(s)
- David Endesfelder
- Department of Mathematics and Technology, RheinAhrCampus, University of Applied Sciences Koblenz, 53424 Remagen, Germany
- Helmholtz Zentrum München, German Research Center for Environmental Health (GmbH), Institute of Biomathematics and Biometry, Scientific Computing Research Unit, Neuherberg, Germany
| | - Rebecca Burrell
- Translational Cancer Therapeutics Laboratory, Cancer Research UK London Research Institute, 44 Lincoln’s Inn Fields, London WC2A 3PX, UK
| | - Nnennaya Kanu
- UCL Cancer Institute, Paul O’Gorman Building, 72 Huntley Street, WC1E 6BT London
| | - Nicholas McGranahan
- Translational Cancer Therapeutics Laboratory, Cancer Research UK London Research Institute, 44 Lincoln’s Inn Fields, London WC2A 3PX, UK
| | - Mike Howell
- High Throughput Screening Laboratory,Cancer Research UK London Research Institute, London, United Kingdom
| | - Peter J. Parker
- Protein Phosphorylation Laboratory, Cancer Research UK London Research Institute, 44 Lincoln’s Inn Fields, London WC2A 3PX, UK
- Division of Cancer Studies, King’s College London, London SE1 1UL
| | - Julian Downward
- Signal Transduction Laboratory, Cancer Research UK London Research Institute, London, United Kingdom
| | - Charles Swanton
- Translational Cancer Therapeutics Laboratory, Cancer Research UK London Research Institute, 44 Lincoln’s Inn Fields, London WC2A 3PX, UK
| | - Maik Kschischo
- Department of Mathematics and Technology, RheinAhrCampus, University of Applied Sciences Koblenz, 53424 Remagen, Germany
| |
Collapse
|
22
|
Domany E. Using High-Throughput Transcriptomic Data for Prognosis: A Critical Overview and Perspectives. Cancer Res 2014; 74:4612-21. [DOI: 10.1158/0008-5472.can-13-3338] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
23
|
Tian F, Wang Y, Seiler M, Hu Z. Functional characterization of breast cancer using pathway profiles. BMC Med Genomics 2014; 7:45. [PMID: 25041817 PMCID: PMC4113668 DOI: 10.1186/1755-8794-7-45] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2013] [Accepted: 07/09/2014] [Indexed: 02/05/2023] Open
Abstract
BACKGROUND The molecular characteristics of human diseases are often represented by a list of genes termed "signature genes". A significant challenge facing this approach is that of reproducibility: signatures developed on a set of patients may fail to perform well on different sets of patients. As diseases are resulted from perturbed cellular functions, irrespective of the particular genes that contribute to the function, it may be more appropriate to characterize diseases based on these perturbed cellular functions. METHODS We proposed a profile-based approach to characterize a disease using a binary vector whose elements indicate whether a given function is perturbed based on the enrichment analysis of expression data between normal and tumor tissues. Using breast cancer and its four primary clinically relevant subtypes as examples, this approach is evaluated based on the reproducibility, accuracy and resolution of the resulting pathway profiles. RESULTS Pathway profiles for breast cancer and its subtypes are constructed based on data obtained from microarray and RNA-Seq data sets provided by The Cancer Genome Atlas (TCGA), and an additional microarray data set provided by The European Genome-phenome Archive (EGA). An average reproducibility of 68% is achieved between different data sets (TCGA microarray vs. EGA microarray data) and 67% average reproducibility is achieved between different technologies (TCGA microarray vs. TCGA RNA-Seq data). Among the enriched pathways, 74% of them are known to be associated with breast cancer or other cancers. About 40% of the identified pathways are enriched in all four subtypes, with 4, 2, 4, and 7 pathways enriched only in luminal A, luminal B, triple-negative, and HER2+ subtypes, respectively. Comparison of profiles between subtypes, as well as other diseases, shows that luminal A and luminal B subtypes are more similar to the HER2+ subtype than to the triple-negative subtype, and subtypes of breast cancer are more likely to be closer to each other than to other diseases. CONCLUSIONS Our results demonstrate that pathway profiles can successfully characterize both common and distinct functional characteristics of four subtypes of breast cancer and other related diseases, with acceptable reproducibility, high accuracy and reasonable resolution.
Collapse
Affiliation(s)
- Feng Tian
- Center for Advanced Genomic Technology, Boston University, Boston, MA 02215, USA
| | - Yajie Wang
- Core Laboratory for Clinical Medical Research, Beijing Tiantan Hospital, Capital Medical University, Beijing, P. R. China
- Department of Clinical Laboratory Diagnosis, Beijing Tiantan Hospital, Capital Medical University, Beijing, P. R. China
| | - Michael Seiler
- Center for Advanced Genomic Technology, Boston University, Boston, MA 02215, USA
| | - Zhenjun Hu
- Center for Advanced Genomic Technology, Boston University, Boston, MA 02215, USA
| |
Collapse
|
24
|
Zhao X, Rødland EA, Sørlie T, Vollan HKM, Russnes HG, Kristensen VN, Lingjærde OC, Børresen-Dale AL. Systematic assessment of prognostic gene signatures for breast cancer shows distinct influence of time and ER status. BMC Cancer 2014; 14:211. [PMID: 24645668 PMCID: PMC4000128 DOI: 10.1186/1471-2407-14-211] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2013] [Accepted: 02/21/2014] [Indexed: 11/06/2022] Open
Abstract
Background The aim was to assess and compare prognostic power of nine breast cancer gene signatures (Intrinsic, PAM50, 70-gene, 76-gene, Genomic-Grade-Index, 21-gene-Recurrence-Score, EndoPredict, Wound-Response and Hypoxia) in relation to ER status and follow-up time. Methods A gene expression dataset from 947 breast tumors was used to evaluate the signatures for prediction of Distant Metastasis Free Survival (DMFS). A total of 912 patients had available DMFS status. The recently published METABRIC cohort was used as an additional validation set. Results Survival predictions were fairly concordant across most signatures. Prognostic power declined with follow-up time. During the first 5 years of followup, all signatures except for Hypoxia were predictive for DMFS in ER-positive disease, and 76-gene, Hypoxia and Wound-Response were prognostic in ER-negative disease. After 5 years, the signatures had little prognostic power. Gene signatures provide significant prognostic information beyond tumor size, node status and histological grade. Conclusions Generally, these signatures performed better for ER-positive disease, indicating that risk within each ER stratum is driven by distinct underlying biology. Most of the signatures were strong risk predictors for DMFS during the first 5 years of follow-up. Combining gene signatures with histological grade or tumor size, could improve the prognostic power, perhaps also of long-term survival.
Collapse
Affiliation(s)
- Xi Zhao
- Department of Genetics, Institute for Cancer Research, Oslo University Hospital, The Norwegian Radium Hospital, Montebello 0310 Oslo, Norway.
| | | | | | | | | | | | | | | |
Collapse
|
25
|
Abstract
We introduce Pathifier, an algorithm that infers pathway deregulation scores for each tumor sample on the basis of expression data. This score is determined, in a context-specific manner, for every particular dataset and type of cancer that is being investigated. The algorithm transforms gene-level information into pathway-level information, generating a compact and biologically relevant representation of each sample. We demonstrate the algorithm's performance on three colorectal cancer datasets and two glioblastoma multiforme datasets and show that our multipathway-based representation is reproducible, preserves much of the original information, and allows inference of complex biologically significant information. We discovered several pathways that were significantly associated with survival of glioblastoma patients and two whose scores are predictive of survival in colorectal cancer: CXCR3-mediated signaling and oxidative phosphorylation. We also identified a subclass of proneural and neural glioblastoma with significantly better survival, and an EGF receptor-deregulated subclass of colon cancers.
Collapse
|
26
|
Shi M, Beauchamp RD, Zhang B. A network-based gene expression signature informs prognosis and treatment for colorectal cancer patients. PLoS One 2012; 7:e41292. [PMID: 22844451 PMCID: PMC3402487 DOI: 10.1371/journal.pone.0041292] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2012] [Accepted: 06/19/2012] [Indexed: 01/08/2023] Open
Abstract
Background Several studies have reported gene expression signatures that predict recurrence risk in stage II and III colorectal cancer (CRC) patients with minimal gene membership overlap and undefined biological relevance. The goal of this study was to investigate biological themes underlying these signatures, to infer genes of potential mechanistic importance to the CRC recurrence phenotype and to test whether accurate prognostic models can be developed using mechanistically important genes. Methods and Findings We investigated eight published CRC gene expression signatures and found no functional convergence in Gene Ontology enrichment analysis. Using a random walk-based approach, we integrated these signatures and publicly available somatic mutation data on a protein-protein interaction network and inferred 487 genes that were plausible candidate molecular underpinnings for the CRC recurrence phenotype. We named the list of 487 genes a NEM signature because it integrated information from Network, Expression, and Mutation. The signature showed significant enrichment in four biological processes closely related to cancer pathophysiology and provided good coverage of known oncogenes, tumor suppressors, and CRC-related signaling pathways. A NEM signature-based Survival Support Vector Machine prognostic model was trained using a microarray gene expression dataset and tested on an independent dataset. The model-based scores showed a 75.7% concordance with the real survival data and separated patients into two groups with significantly different relapse-free survival (p = 0.002). Similar results were obtained with reversed training and testing datasets (p = 0.007). Furthermore, adjuvant chemotherapy was significantly associated with prolonged survival of the high-risk patients (p = 0.006), but not beneficial to the low-risk patients (p = 0.491). Conclusions The NEM signature not only reflects CRC biology but also informs patient prognosis and treatment response. Thus, the network-based data integration method provides a convergence between biological relevance and clinical usefulness in gene signature development.
Collapse
Affiliation(s)
- Mingguang Shi
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, United States of America
| | - R. Daniel Beauchamp
- Department of Cancer Biology, Vanderbilt University School of Medicine, Nashville, Tennessee, United States of America
- Department of Surgery, Vanderbilt University School of Medicine, Nashville, Tennessee, United States of America
- Department of Cell and Developmental Biology, Vanderbilt University School of Medicine, Nashville, Tennessee, United States of America
- Vanderbilt-Ingram Cancer Center, Vanderbilt University School of Medicine, Nashville, Tennessee, United States of America
| | - Bing Zhang
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, United States of America
- Department of Cancer Biology, Vanderbilt University School of Medicine, Nashville, Tennessee, United States of America
- Vanderbilt-Ingram Cancer Center, Vanderbilt University School of Medicine, Nashville, Tennessee, United States of America
- * E-mail:
| |
Collapse
|
27
|
Mefford D, Mefford J. Stromal genes add prognostic information to proliferation and histoclinical markers: a basis for the next generation of breast cancer gene signatures. PLoS One 2012; 7:e37646. [PMID: 22719844 PMCID: PMC3377707 DOI: 10.1371/journal.pone.0037646] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2011] [Accepted: 04/26/2012] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND First-generation gene signatures that identify breast cancer patients at risk of recurrence are confined to estrogen-positive cases and are driven by genes involved in the cell cycle and proliferation. Previously we induced sets of stromal genes that are prognostic for both estrogen-positive and estrogen-negative samples. Creating risk-management tools that incorporate these stromal signatures, along with existing proliferation-based signatures and established clinicopathological measures such as lymph node status and tumor size, should better identify women at greatest risk for metastasis and death. METHODOLOGY/PRINCIPAL FINDINGS To investigate the strength and independence of the stromal and proliferation factors in estrogen-positive and estrogen-negative patients we constructed multivariate Cox proportional hazards models along with tree-based partitions of cancer cases for four breast cancer cohorts. Two sets of stromal genes, one consisting of DCN and FBLN1, and the other containing LAMA2, add substantial prognostic value to the proliferation signal and to clinical measures. For estrogen receptor-positive patients, the stromal-decorin set adds prognostic value independent of proliferation for three of the four datasets. For estrogen receptor-negative patients, the stromal-laminin set significantly adds prognostic value in two datasets, and marginally in a third. The stromal sets are most prognostic for the unselected population studies and may depend on the age distribution of the cohorts. CONCLUSION The addition of stromal genes would measurably improve the performance of proliferation-based first-generation gene signatures, especially for older women. Incorporating indicators of the state of stromal cell types would mark a conceptual shift from epithelial-centric risk assessment to assessment based on the multiple cell types in the cancer-altered tissue.
Collapse
|
28
|
Cun Y, Fröhlich HF. Prognostic gene signatures for patient stratification in breast cancer: accuracy, stability and interpretability of gene selection approaches using prior knowledge on protein-protein interactions. BMC Bioinformatics 2012; 13:69. [PMID: 22548963 PMCID: PMC3436770 DOI: 10.1186/1471-2105-13-69] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2011] [Accepted: 05/01/2012] [Indexed: 01/17/2023] Open
Abstract
Background Stratification of patients according to their clinical prognosis is a desirable goal in cancer treatment in order to achieve a better personalized medicine. Reliable predictions on the basis of gene signatures could support medical doctors on selecting the right therapeutic strategy. However, during the last years the low reproducibility of many published gene signatures has been criticized. It has been suggested that incorporation of network or pathway information into prognostic biomarker discovery could improve prediction performance. In the meanwhile a large number of different approaches have been suggested for the same purpose. Methods We found that on average incorporation of pathway information or protein interaction data did not significantly enhance prediction performance, but indeed greatly interpretability of gene signatures. Some methods (specifically network-based SVMs) could greatly enhance gene selection stability, but revealed only a comparably low prediction accuracy, whereas Reweighted Recursive Feature Elimination (RRFE) and average pathway expression led to very clearly interpretable signatures. In addition, average pathway expression, together with elastic net SVMs, showed the highest prediction performance here. Results The results indicated that no single algorithm to perform best with respect to all three categories in our study. Incorporating network of prior knowledge into gene selection methods in general did not significantly improve classification accuracy, but greatly interpretability of gene signatures compared to classical algorithms.
Collapse
Affiliation(s)
- Yupeng Cun
- Algorithmic Bioinformatics, Bonn-Aachen International Center for IT, Dahlmannstraße, Bonn, Germany
| | | |
Collapse
|
29
|
Yeung KY, Gooley TA, Zhang A, Raftery AE, Radich JP, Oehler VG. Predicting relapse prior to transplantation in chronic myeloid leukemia by integrating expert knowledge and expression data. ACTA ACUST UNITED AC 2012; 28:823-30. [PMID: 22296787 DOI: 10.1093/bioinformatics/bts059] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
MOTIVATION Selecting a small number of signature genes for accurate classification of samples is essential for the development of diagnostic tests. However, many genes are highly correlated in gene expression data, and hence, many possible sets of genes are potential classifiers. Because treatment outcomes are poor in advanced chronic myeloid leukemia (CML), we hypothesized that expression of classifiers of advanced phase CML when detected in early CML [chronic phase (CP) CML], correlates with subsequent poorer therapeutic outcome. RESULTS We developed a method that integrates gene expression data with expert knowledge and predicted functional relationships using iterative Bayesian model averaging. Applying our integrated method to CML, we identified small sets of signature genes that are highly predictive of disease phases and that are more robust and stable than using expression data alone. The accuracy of our algorithm was evaluated using cross-validation on the gene expression data. We then tested the hypothesis that gene sets associated with advanced phase CML would predict relapse after allogeneic transplantation in 176 independent CP CML cases. Our gene signatures of advanced phase CML are predictive of relapse even after adjustment for known risk factors associated with transplant outcomes.
Collapse
Affiliation(s)
- K Y Yeung
- Department of Microbiology, University of Washington, Seattle, WA 98195, USA.
| | | | | | | | | | | |
Collapse
|